


са а са а са L3 са са га са са са -з га га -а ! 


ЕСЕМЕМТАВУ 
ЕОВЕ5Т 
SAMPLING 


Agriculture Handbook No. 232 


& 


ELEMENTARY 
STATISTICAL METHODS 
FOR FORESTERS 


AGRICULTURE HANDBOOK 317 


U.S. Department of Agriculture 
Forest Service 





É 


TY 
Aarne? 


Кулери 


ЕСЕМЕМТАВУ 
FOREST 
SAMPLING 


FRANK FREESE 
Southern Forest Experiment Station, Forest Service 


Agriculture Handbook No. 232 December 1962 


U.S. Department of Agriculture Forest Service 











ACKNOWLEDGMENTS 


I should like to express my appreciation to Professor George W. 
Snedecor of the Iowa State University Statistical Laboratory and 
to the Iowa State University Press for their generous permission 
to reprint tables 1, 3, and 4 from their book Statistical Methods, 
5th edition. Thanks are also due to Dr. C. I. Bliss of the Connecti- 
cut Agricultural Experiment Station, who originally prepared the 
material in table 4. I am indebted to Professor Sir Ronald A. 
Fisher, F.R.S., Cambridge, and to Dr. Frank Yates, F.R.S., Roth- 
amsted, and to Messrs. Oliver and Boyd Ltd., Edinburgh, for 
permission to reprint table 2 from their book Statistieal Tables 
for Biological, Agricultural, and Medical. Research. 


FRANK FREESE 
Southern Forest Experiment Station 

















НИ са ын иш иш иш d ма ШО иш иш x en вы ши жи 


ELEMENTARY FOREST SAMPLING 


This is a statistical cookbook for foresters. It presents some 
sampling methods that have Бееп Ѓоипд useful in forestry. No 
attempt is made to go into the theory behind these methods. This 
has some dangers, but experience has shown that few foresters 
will venture into the intricacies of statistical theory until they 
are familiar with some оѓ the common sampling designs and 
computations. 

The aim here is to provide that familiarity. Readers who attain 
such familiarity will be able to handle many of the routine sam- 
pling problems. They will also find that many problems have been 
left unanswered and many ramifications of sampling ignored. It 
is hoped that when they reach this stage they will delve into more 
comprehensive works on sampling. Several very good ones are 
listed on page 78. 


BASIC CONCEPTS 


Why Sample? 

Most human decisions are made with incomplete knowledge. In 
daily life, a physician may diagnose disease from a single drop of 
blood or a microscopic section of tissue; a housewife judges a 
watermelon by its “plug” or by the sound it emits when thumped; 
and amid a bewildering array of choices and claims we select 
toothpaste, insurance, vacation spots, mates, and careers with but 
a fragment of the total information necessary or desirable for 
complete understanding. All of these we do with the ardent hope 
that the drop of blood, the melon plug, and the advertising claim 
give a reliable picture of the population they represent. 

In manufacturing and business, in science, and no less in fores- 
try, partial knowledge is a normal state. The complete census is 
rare—the sample is commonplace. A ranger must advertise timber 
sales with estimated volume, estimated grade yield and value, esti- 
mated cost, and estimated risk. The nurseryman sows seed whose 
germination is estimated from a tiny fraction of the seedlot, and 
at harvest he estimates the seedling crop with sample counts in 
the nursery beds. Enterprising pulp companies, seeking a source 
of raw material in sawmill residue, may estimate the potential 
tonnage of chippable material by multiplying reported production 
by a set of conversion factors obtained at a few representative 
sawmills. 

However desirable a complete measurement may seem, there are 
several good reasons why sampling is often preferred. In the first 
place, complete measurement or enumeration may be impossible. 
The nurseryman might be somewhat better informed if he knew 


1 











2 AGRICULTURE HANDBOOK 282, U.S, DEPT. OF AGRICULTURE 


the germinative capacity of all the seed to be sown, but the de- 
Structive nature of the germination test precludes testing every 
seed. For identical reasons, it is impossible to measure the bend- 
ing strength of all the timbers to be used in a bridge, the tearing 
strength of all the paper to be put into a book, or the grade of 
all the boards to be produced in a timber sale. If the tests were 
permitted, no seedlings would be produced, no bridges would be 
built, no books printed, and no stumpage sold. Clearly where test- 
ing is destructive, some sort of sampling is inescapable. 

In other instances total measurement or count is not feasible. 
Consider the staggering task of testing the quality of all the water 
in a reservoir, weighing all the fish in a stream, counting all the 
seedlings in а 500-bed nursery, enumerating all the egg masses in 
а turpentine beetle infestation, measuring diameter and height of 
all the merchantable trees in a 10,000-acre forest. Obviously, the 
enormity of the task would demand some sort of sampling 
procedure. 

It is well known that sampling will frequently provide the essen- 
tial information at a far lower cost than a complete enumeration. 
Less well known is the fact that this information may at times be 
more reliable than that obtained by a 100-percent inventory. There 
are several reasons why this might be true. With fewer observa- 
tions to be made and more time available, measurement of the 
units in the sample can be and is more likely to be made with greater 
care. In addition, a portion of the saving resulting from sampling 
could be used to buy better instruments and to employ or train 
higher caliber personnel. It is not hard to see that good measure- 
ments on 5 percent of the units in а population could provide more 
reliable information than sloppy measurements on 100 percent 
of the units. 

Finally, since sample data can be collected and processed in a 
fraction of the time required for a complete inventory, the infor- 
mation obtained may be more timely. Surveying 100 percent of 
the lumber market is not going to provide information that is very 
useful to a seller if it takes 10 months to complete the job. 





Populations, Parameters, and Estimates 


The central notion in any sampling problem is the existence of 
a population. It is helpful to think of a population аз ап aggregate 
of unit values, where the “unit” is the thing upon which the obser- 
vation is made, and the "value" is the property observed on that 
thing. For example, we may imagine a square 40-acre tract of 
timber in which the unit being observed is the individual tree and 
the value being observed is tree height. The population is the 
aggregate of all heights of trees on the specified forty. The diam- 
eters of these same trees would be another population. The cubic 
volumes in some particular portion of the stems constitute still 
another population. 

Alternatively, the units might be defined as the 400 1-chain- 
square plots into which the tract could be divided. The cubic 
volumes of trees on these plots might form one population. The 
board-foot volumes of the same trees would be another popula- 


ша ша ша UM 


= про um иң ри 


E 


= 





НЕ НЕ иш UM 


иш | 


НЕ НЕ NE кщ 





са 


ELEMENTARY FOREST SAMPLING 3 


tion. The number of earthworms in the top 6 inches of soil on 
these plots could be still a third population. 

Whenever possible, matters will be simplified if the units in 
which the population is defined are the same as those to be selected 
in the sample. If we wish to estimate the total weight of earth- 
worms in the top 6 inches of soil for some area, it would be best 
to think of a population made up of blocks of soil of some specified 
dimension with the weight of earthworms in the block being the 
unit value. Such units are easily selected for inclusion in the 
sample, and projection of sample data to the entire population is 
relatively simple. If we think of individual earthworms as the 
units, selection of the sample and expansion from the sample to 
the population may both be very difficult. 

To characterize the population as a whole, we often use certain 
constants that are called parameters. The mean value per plot'in 
а population of quarter-acre plots is a parameter. The proportion 
of living seedlings in a pine plantation is a parameter. The total 
number of units in the population is a parameter, and so is the 
variability among the unit values. 

The objective of sample surveys is usually to estimate some 
parameter or a function of some parameter or parameters. Often, 
but not always, we wish to estimate the population mean or total. 
The value of the parameter as estimated from a sample will here- 
after be referred to as the sample estimate or simply the estimate. 


Bias, Accuracy, and Precision 


In seeking an estimate of some population trait, the sampler's 
fondest hope is that at a reasonable cost he will obtain an estimate 
that is accurate (1.е., close to the true value). Without any help 
from sampling theory he knows that if bias rears its insidious 
head, accuracy will flee the scene. And he has a suspicion that 
even though bias is eliminated, his sample estimate may still not 
be entirely precise. When only a part of the population is meas- 
ured, some estimates may be high, some low, some fairly close, 
and unfortunately, some rather far from the true value. 

Though most people have a general notion as to the meaning of 
bias, accuracy, and precision, it might be well at this stage to state 
the statistical interpretation of these terms. 

Bias.—Bias is a systematic distortion. It may be due to some 
flaw in measurement, to the method of selecting the sample, or to 
the technique of estimating the parameter. If, for example, seed- 
ling heights are measured with a ruler from which the first half- 
inch has been removed, all measurements will be one-half inch too 
large and the estimate of mean seedling height will be biased. In 
studies involving plant counts, some observers will nearly always 
include a plant that is on the plot boundary; others will consist- 
ently exclude it. Both routines are sources of measurement bias. In 
timber cruising, the volume table selected or the manner in which 
itis used may result in bias. A table made up from tall timber will 
give biased results when used without adjustment on short-bodied 
trees. Similarly, if the cruiser consistently estimates merchantable 
height above or below the specifications of the table, volume so 





4 AGRICULTURE HANDBOOK 282, 0.8. DEPT. OF AGRICULTURE 


estimated will be biased. The only practical way to minimize 
measurement bias is by continual check of instrumentation, and 
meticulous training and care in the use of instruments. 

Bias due to method of sampling may arise when certain units 
are given a greater or lesser representation in the sample than in 
the population. As an elementary example, assume that we are es- 
timating the survival of 10,000 trees planted in 100 rows of 100 
trees each. If the sample were selected only from the interior 
98 x 98 block of trees in the interest of obtaining a “more repre- 
sentative” picture of survival, bias would occur simply because 
the border trees had no opportunity to appear in the sample. 

The technique of estimating the parameter after the sample 
has been taken is also a possible source of bias. If, for example, 
the survival on a planting job is estimated by taking a simple 
arithmetic average of the survival estimates from two fields, the 
resulting average may be seriously biased if one field is 500 acres 
and the other 10 acres in size. A better overall estimate would be 
obtained by weighting the estimates for the two fields in propor- 
tion to the field sizes. Another example of this type of bias occurs 
in the common forestry practice of estimating average diameter 
from the diameter of the tree of mean basal area. The latter pro- 
cedure actually gives the square root of the mean squared diam- 
eter, which is not the same as the arithmetic mean diameter unless 
all trees are exactly the same size. 

Bias is seldom desirable, but it is not a cause for panic. It is 
something a sampler may have to live. with. Its complete elimina- 
tion may be costly in dollars, precision, or both. The important 
thing is to recognize the possible sources of bias and to weigh the 
effects against the cost of reducing or eliminating it. Some of the 
procedures discussed in this handbook are known to be slightly 
biased. They are used because the bias is often trivial and because 
they may be more precise than the unbiased procedures. 

Precision and accuracy.—A badly biased estimate may be pre- 
cise but it can never be accurate. Those who find this hard to 
swallow may be thinking of precision as being synonymous with 
accuracy. Statisticians being what they are, it will do little good 
to point out that several lexicographers seem to think the same 
way. Among statisticians асситасу refers to the success of esti- 
mating the true value of a quantity ; precision refers to the cluster- 
ing of sample values about their own average, which, if biased, 
cannot be the true value. Accuracy, or closeness to the true value, 
may be absent because of bias, lack of precision, or both. 

А target shooter who puts all of his shots in a quarter-inch 
circle in the 10-ring might be considered accurate; his friend who 
puts all of his shots in а quarter-inch cirele at 12 o'clock in the 
6-ring would be considered equally precise but nowhere near as 
accurate. An example for foresters might be a series of careful 
measurements made of a single tree with a vernier caliper, one 
arm of which is not at right angles to the graduated beam. 
Because the measurements have been carefully made they should 
not vary a great deal but should cluster closely about their mean 
value: they will be precise. However, as the caliper is not properly 


mm иш ш 


а 


НЕ иң m MM Ша ш 





ша иш 


ва E 


Се 


pas] 


f 


Иш 


Иң ма mm mS mE mE RE 





ри mm m mm 


Eg 


ELEMENTARY FOREST SAMPLING 5 


adjusted the measured values.will be.off the true value (bias) and 
the diameter estimate will be inaccurate. If the caliper is properly 
adjusted but is used carelessly the measurements may be unbiased 
but they will be neither accurate nor precise. 


Variables, Continuous and Discrete 


Variation is one of the facts of life. It is difficult to say whether 
this is:good ог bad, but we сап вау. that without it there would be 
no sampling problems (ог statisticians). How to cope with some 
of the sampling problems.created by natural variation is the 
subject of this handbook: 

To understand statisticians it is helpful to know their language, 
and in this language the term: variable plays an active part. A 
characteristic that may vary from unit to unit is called a variable. 
Tn a population of trees, tree height is а variable, so are tree diam- 
eter, number of cones, cubic volume, and form class. As some trees 
тау be loblolly pine, some slash pine, and some dawn redwoods, 
species is also a variable. Presence or absence of insects, the color 
of the foliage, and the fact that the tree is alive or dead are vari- 
ables also. 

А variable that is characterized by being related to some nu- 
merical scale of measurement, any interval of which may, if de- 
sired, be subdivided into an infinite number of values, is said to 
be continuous. Length, height, weight, temperature, and volume 
are examples of variables that can usually be labeled continuous. 
Qualitative variables and those that are represented by integral 
values or ratios of integral values are said to be discrete. Two 
forms of discrete data may be recognized: attributes and counts. 
Та the first of these the individual is classified as having or not 
having some attribute; or, more commonly, a group of individuals 
18 described by the proportion or percentage having a particular 
attribute. Some familiar examples are the proportion of slash pine 
seedlings infected by rust, the percentage of stocked milacre quad- 
rats, and the survival percentage of planted seedlings. In the 
second form, the individual is described by a count that cannot 
be expressed as a proportion. Number of seedlings on a milacre, 
number of weevils in à cone, number of sprouts on a stump, and 
number of female flowers on a tree are common examples. 

А distinction is made between continuous and discrete variables 
because the two types of data may require different statistical 
procedures, Most of the sampling methods and computational pro- 
cedures described in this handbook were developed primarily for 
use with continuous variables. The procedures that have been de- 
vised for discrete variables are generally more complex. By in- 
creasing the number of values that a discrete variable can assume, 
however, it is often possible to handle such data by the continuous- 
variable methods. Thus, germination percentages based on 200 or 
more seeds per dish can usually be treated by the same procedures 
that would be used for measurement data. The section that begins 
on page 61 describes simple random sampling with classification 
дака апа gives some illustrations of how the sampling procedures 
for continuous data тау Бе used for classification and count data. 


6 AGRICULTURE HANDBOOK 232, Џ.5, DEPT. OF AGRICULTURE 


Distribution Functions 


А distribution function shows, for a population, the relative fre- 
quency with which different values of a variable oceur. Knowing 
the distribution function, we can say what proportion of the indi- 
viduals are within certain size limits. 

Each population has its own distinct distribution function. 
There are, however, certain general types of function that occur 
quite frequently. The most common are the normal, binomial, and 
Poisson. The bell-shaped normal distribution, familiar to most 
foresters, is often encountered in dealing with continuous vari- 
ables. The binomial is associated with data where а fixed number 
of individuals are observed on each unit and the unit is charac- 
terized by the number of individuals having some particular at- 
tribute. The Poisson distribution may arise where individual units 
are characterized by a count having no fixed upper limit, particu- 
larly if zero or very low counts tend to predominate. 

The form of the distribution function dictates the appropriate 
statistical treatment of a set of data. The exact form of the dis- 
tribution will seldom be known, but some indications may be ob- 
tained from the sample data or from a general familiarity with 
the population. The methods of dealing with normally distributed 
data are simpler than most of the methods that have been de- 
veloped for other distributions. 

Fortunately, it has been shown that, regardless of the distribu- 
tion which a variable follows, the means of large samples tend to 
follow a distribution that approaches the normal and may be 
treated by normal distribution methods. 


TOOLS OF THE TRADE 


Subscripts, Summations, and Brackets 


In describing the various sampling methods, frequent use will 
be made of subscripts, brackets, and summation symbols. Some 
beginning samplers will be unhappy about this; others will be 
downright mad. The purpose though, is not to impress or confuse 
the reader. These devices are, like the more familiar notations 
of --, —, and ==, merely a concise way of expressing ideas that 
would be ponderous if put into conventional language. And like 
the common algebraic symbols, using and understanding them is 
just a matter of practice. 

Subseripts.—The appearance of an 2), Zw, or Yin» brings а frown 
of annoyance and confusion to the face of many a forester. Yet 
interpreting this notation is quite simple. In æ, the subseript 7 
means that сап take on different forms or values. Putting in a 
particular value of i tells which form or value of æ we are con- 
cerned with. The 7 might imply a particular characteristic of an 
individual. The term x, might be the height of the individual, c. 
might be his weight, x; his age, and so forth, Or the subscript 
might imply a particular individual. In this case, æ, could be the 
height of the first individual, хо the height of the second, z, the 








ium bm ша па иш 


ка 





Е 


— 





ма mS на ма 


ELEMENTARY FOREST SAMPLING T, 


height of the third individual, and во forth. Which meaning is in- 
tended will usually be.clear from the.context. 

A variable (say г) will often be identified in more than one 
way. Thus, we might want to refer to the age of the second indi- 
vidual or the height of the first individual. This dual classification 
is accomplished with two subscripts. тога the i might identify 
the characteristic (for height, i = 1; for weight, i = 2; and for 
age, i — 3). The k could be used to designate which individual we 
are dealing with. Then, гот would tell us that we are dealing with 
the weight (Е ) of the seventh (k = 7) individual. This proc- 
ess can be саг! to any length needed. If the individuals in the 
above example were from different groups we could use another 
subscript (say j) to identify the group. The symbol ху, would 
indicate the i characteristic of the k individual of Ше 7" group. 

Summations.—To indicate that several (say 6) values of a vari- 
able (24) are to be added together we could write 


(zy аа 2 a + 2 + 26) 
A slightly shorter way of saying the same thing is 
(а-++%+...+%.) 


The three dots (...) indicate that we continue to do the same 
thing for all the values from zs through zy as we have already 
done to x, and хо. 

The same operation can be expressed more compactly by 





^ 
zum 
= 


In words this tells us to sum all values of г, letting i go from 1 
up to 6. The symbol x, which is the Greek letter sigma, indicates 
that a summation should be performed. The z tells what is to be 
summed and the letter above and below x indicates the limits over 
which the subscript 4 will be allowed to vary. 

Най of the values in a series are to be summed, the range of 
summation is frequently omitted from the summation sign giving 


> Dx, or sometimes, Sx 


All of these imply that we would sum all values of 2. 

The same principle extends to variables that are ‘identified by 
two or more subscripts. A separate summation sign may be used 
for each subscript. Thus, we might have 





This would tell us to add up all the values of ху having j from 
1 to 4 andi from 1 to 3. Written the long way, this means 


(ал 21а + 18+ Via Фол + 122 
ага ави ла + ©з + 23а + За) 





8 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 

As for a single subscript, when all values in a series are to be 
summed, the range of summation may be omitted, and sometimes 
а single summation symbol suffices. The above summation might 
be symbolized by 


= > чу D 24) DS ty >) or maybeeven 22 = 
1j 


ТЕ a numerical value is substituted for one of the letters in the 
subscript, the summation is to be performed by letting the letter 
subscript vary but holding the other subscript at the specified 
value. Ав an example, 


à Zay = (Фва + La + as + 234) 


and, 
P = (21.2 + 22.2 + 082 ча + 252) 


Bracketing.—When other operations are to be performed along 
with the addition, some form of bracketing may be used to indicate 
the order of operations. For example, 


Dr 
D 


tells us to square each value of z, and then add up these squared 


values. But 
(ва) 


tells us to add all the z, values and then square the sum. 
The expression 


> Бар 

i 

says to square each ху value and then add the squares. But 
= (==) 
PVT 


says that for each value of i we should first add up the ху; over 
all values of j. Next, this (z го) is squared and these squared 


sums are added up over all values of 7. If the range of j is from 
1 to 4 and the range of i is from 1 to 8, then this means 


wm uM GN 


НЕ ша иш иш 


s шш 


= NN 


БЕ 


& 


wa 





ELEMENTARY: FOREST SAMPLING 9 


у (па tie ра + 214)? 
+ 23 ааа ааа + 224)? 
(ава ава + Фал + 234)? 





The. expression 
2 
(23) 


would tell us to add up the z values over all combinations of i 
and j and then square the total. Thus, 


(È > “у m (ала + 212 2з + ia + 22i аа 


ағ 
+ 223 + Фад даа + Фад ааа + Хы)? 


Where operations involving two.or more different variables are 
to be performed, the same principles apply. 


à аа ги + Vaya + Las 
But, 
(à x) (2 и) = (ж + 22 + 2) (Y1 + № + Ys) 
N.B.: It is easily seen but often forgotten that 
> хе із not usually equal to (2 =) 
Similarly, 
x zy; is not usually equal to (z =) (2 и) 


Some practice.—1f you feel uncomfortable in the presence of 
this.symbology, try the worked examples on page 79. 


Variance 


Tn a stand of trees, the diameters will usually show some varia- 
tion. Some will be larger than the mean diameter, some smaller, 
and some fairly close to the mean. Clearly, it would be informa- 
tive to know something about this variation. It is not hard to see 


10 AGRICULTURE HANDBOOK 282, U.S, DEPT. ОР AGRICULTURE 


that more observations would be needed to get a good estimate of 
the mean diameter in a stand where diameters vary from 2 to 30 
inches than where the range is from 10 to 12 inches. The measure 
of variation most commonly used by statisticians is the variance, 

The variance of individuals in a population is a measure of the 
dispersion of individual unit values about their mean. A large 
variance indicates wide dispersion, a small variance indicates little 
dispersion. The variance of individuals is à population character- 
istic (a parameter). Very rarely wil we know the population 
variance. Usually it must be estimated from the sample data. 

For most types of forest measurement data, the estimate of the 
variance from a simple random sample is given by 


Вии: 
(n — 1) 





8° = 


Where: s? = Sample estimate of the population variance. 
Un ‘he value of the 7" unit in the sample. 
9 = The arithmetic mean of the sample, i.e., 





n = The number of units observed in the sample. 
Though it may not appear so, computation of the sample vari- 
ance is simplified by rewriting the above equation as 


à» n) 
210 


Suppose we have observations on three units with the values 7, 
8, and 12. For this sample our estimate of the variance is 


89 = 


(27)? 
— ч Tg — 281—948 _ 


2 2 





7 


The standard deviation, а term familiar to the survivors of most 
forest mensuration courses, is merely the square root of the vari- 
ance. It is symbolized by s, and in the above example would be 
estimated as s == ү? = 2.6458. 


Standard Errors and Cónfidénce Limits 


Like the individual units in a population, sample estimates are 
subject to variation. The mean diameter of a stand as estimated 





= 


ша im bm 5m im 


im 


ща dm ва 


to 


ва 


iu) mà иш ын эш иш юн шы шй ви пн ий эш иш ыш пей 





ELEMENTARY FOREST SAMPLING 11 


from a sample of 3 trees will seldom be the same as the estimate 
that would have been obtained from other samples of 3 trees. One 
estimate might be close to the mean but a little high. Another 
might be quite a bit high, and the next might be below the mean. 
The estimates vary because different individual units are observed 
in the different. samples, 

Obviously, it would be desirable to have some indication of how 
much variation might be expected among sample estimates. An 
estimate of mean tree diameter that would ordinarily vary be- 
tween 11 and.12 inches would inspire more confidence than one 
that might range from 6 to 18 inches. 

The. previous section discussed the variance and the standard 
deviation (standard deviation = \/variance) as measures of the 
variation among individuals in a population. Measures of the 
same form are used to indicate how a series of estimates might 
vary, They are called the variance of the estimate and the 
standard error of estimate (standard error of estimate = 
v variance of estimate). The term, standard error of estimate, is 
usually shortened to standard error when the estimate referred 
to is obvious. 

The standard error is merely a standard deviation, but among 
estimates rather than among individual units. In fact, if several 
estimates were obtained by repeated sampling of a population, the 
variance and standard error of these estimates could be computed 
from the equations given in the previous section for the variance 
and standard deviation of individuals. But repeated sampling: is 
unnecessary; the variance and standard error can be obtained 
from a single set of sample-units. Variability of an estimate de- 
pends оп the sampling method, the sample size, and the variability 
among the individual units in the population, and these are the 
pieces of information needed to compute the variance and stand- 
ard error. For each of the sampling methods described in this 
handbook, the procedure for computing the standard error of 
estimate will be given. 

Computation of a standard error is often regarded as an un- 
necessary frill by some self-styled practical foresters. The fact is, 
however, that а sample estimate is almost worthless without some 
indication of: its reliability. 

Given the standard error, it is possible to establish limits that 
suggest how close we might be to the parameter being estimated. 
These are called confidence limits. For large samples we сап take 
аз a rough guide that, unless a 1-in-3 chance has occurred in 
sampling, the parameter will be within one standard error of the 
estimated value. Thus, for a. sample mean tree diameter of 16 
inches with a standard error of 1.5 inches, we can say that the 
true mean is somewhere within the limits 14.5 to 17.5 inches. In 
making such statements we will, over the long run, be right an 
average of two times out of three. One time out of three we will, 
because of natural sampling variation, be wrong. The .values 
given by the sample estimate plus or minus опе standard error 
are called the 67-percent confidence limits. By spreading the limits 
we can be more confident that they will include the parameter. 





12 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


Thus, the estimate plus or minus two standard errors will give 
limits that will include the parameter unless à 1-in-20 chance has 
occurred. These are called the 95-percent confidence limits. The 
99-percent confidence limits are defined by the mean plus or minus 
2.6 standard errors. The 99-percent confidence limits will include 
the parameter unless a 1-in-100 chance has occurred. 

It must be emphasized that this method of computing confidence 
limits will give valid approximations only for large samples. The 
definition of a large sample depends on the population itself, but 
in general any sample of less than 30 observations would not 
qualify. Some techniques of computing confidence limits for small 
samples will be discussed for a few of the sampling methods. 


Expanded Variances and Standard Errors 


Very often an estimate will be multiplied by a constant to put 
it in a more meaningful form. For example, if a survey has been 
made using one-fifth acre plots and the mean volume per plot 
computed, this estimate would be multiplied by 5 in order to put 
the estimated mean on a per acre basis. Or, for a tract of 800 
acres the mean volume per fifth-acre plot would be multiplied by 
4,000 (the number of one-fifth acres in the tract) in order to 
estimate the total volume. 

Since expanding a variable in this way must also expand its 
variability, it will be necessary to compute a variance and stand- 
ard error for these expanded values. This is easily done. If the 
variable x has variance s? and this variable is multiplied by a con- 
stant (say №), the product (kx) will have а variance of К?з?, 

Suppose the estimated mean volume per one-fifth acre plot is 
1,400 board feet with a variance of 2,500 board feet (giving a 
standard error of \/2,500 = 50 board feet). The mean volume 
per acre is 


Mean volume per acre — 5(1,400) — 7,000 board feet 
and the variance of this estimate is 


Variance of mean volume per acre — (52) (2,500) — 62,500. 
The standard error of the mean volume per acre would be 


V Variance of mean уоште per acre — 250 board feét 


Note that if the standard deviation (or standard error) of г is 
5, then the standard deviation (or standard error) of ka is merely 
ks. So, in the above case, since the standard error of the estimated 
mean volume per fifth-acre plot is 50, the standard error of the 
mean volume per acre is (5) (50) — 250. 

This is a simple but very important rule and anyone who will 
be dealing with sample estimates should master it. 

Variables may also be expanded by the addition of a constant. 
Expansion of this type does not affect variability and requires no 
adjustment of the variance or standard errors. Thus if 


ЕЖЕ 





ши mm m иш иц 


ша ин ша нщ ом иш 


ва на 


ЕЕ 


E 


= 











е mà ий ий на ры mé 


ELEMENTARY FOREST. SAMPLING 13 


where г ја а variable and k a:constant, then 
Suc 8,3 


This situation arises where for computational purposes the data 
have been coded by the subtraction of a constant. The variance 
and standard error of the coded values are the same as for the 
uncoded values. Given the three observations 127, 104, and 114 
we could, for ease of computation, code these values by subtract- 
ing 100 from each, to make 27, 4, and 14. The variance of the 
coded values is 

2 
ст 4-42 4. м) — 45) 
8° = — — = 188 


which is Ше same as Ше variance of the original values 


(1272 + 104? + 1142) — 45 


2 


Coefficient of Variation 
The coefficient of variation (C) is the ratio of the standard de- 
viation to the mean. For a sample with а mean! of  — 10 and a 
standard deviation of s — 4 we would estimate the coefficient of 
variation as 


= 133 





= 


Variance, our measure of variability among units, is often ге- 
lated to the mean size of the units; large items tend to have a 
larger variance than small items. For example, the variance in a 
population of tree heights would be larger than the variance of 
the heights of a population of foresters. The coefficient of varia- 
tion puts the expression of variability on a relative basis. The 
population of tree heights might have a standard deviation of 4.4 
feet while the population of foresters might have a standard de- 
viation of 0.649 foot. In absolute units, the trees are more variable 
than the foresters. But, if the mean tree height is 40 feet and the 
mean height of the foresters is 5.9 feet, the two populations would 
have the same relative variability. They would both have a co- 
efficient of variation of C — 0.11. 

Variance also depends on the measurement units used. The 
standard deviation of foresters’ heights was 0.649 foot. Had the 
heights been measured in inches, the standard deviation would 
have been 12 times as large (If 2 = 12x 3. = 128,) or 7.788 
inches. But the coefficient of variation would be the same regard- 
less of the unit of measure. In either case, we would have 

s _ 0.649foot _ 7.788 inches 


Ст т = B9fet = 708 inches = 11 ог 11 percent 


1The sample mean of a variable z is frequently symbolized by Ӯ. 





14 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


In addition to putting variabilities on а comparable basis, the 
coefficient of variation simplifies the job of estimating and re- 
membering the degree of variability of different populations. In 
many of the populations with which foresters deal, the coefficient 
of variation is approximately 100 percent. Because it is often 
possible to guess at the size of the population mean, we can readily 
estimate the standard deviation. Such information is useful in 
planning а sample survey. 


Covariance 


In some sampling methods measurements are made on two or 
more characteristics for each sample unit. In measuring forage 
production, for example, we might get the green weight of the 
grass clipped to a height of 1 inch from a circular plot 1 foot in 
diameter. Later we might get the ovendry weight of the same 
sample. 

Covariance is a measure of how two variables vary in relation- 
ship to each other (covariability). Suppose the two variables are 
labeled y and х. If the larger values of y tend to be associated 
with the larger values of =, the covariance will be positive. If the 
larger values of y are associated with the smaller values of z, the 
covariance will be negative. When there is no particular associa- 
tion of y and x values, the covariance approaches zero. Like 
the variance, the covariance is a population characteristic—a 
parameter. 

For simple random samples, the formula for the estimated co- 
variance (s,,) of x and y is 


z (-2) (v — 9) 


fy предна 





Computation of the sample covariance is simplified by rewriting 
the formula 


г — 


it 





Soy = 





Suppose that a sample of и = 6 units has produced the follow- 
ing г and y values: 





i 1 2 3 4 5 6 | Totals 
Vs 2 12 T -л4 “11 8 54 
а | 12 4 1 3 6 7 | 42 








pu med ин ни 


zy 


ЕЕ 


ELEMENTARY FOREST SAMPLING 15 
Then, 
(2) а® + a2) w +... + @® (т) - (8982) 


Иж (6—1) 
_ 806 — 378 — 
= 806 = 878 _ 





—14.4 


The negative value indicates that the larger values of y tend 
to be associated with the smaller values of =. 


Correlation Coefficient 


The magnitude of the covariance, like that of the variance, is 
often related to the size of the unit values. Units with large values 
of х and y tend to have larger covariance values than units with 
smaller г and y values. A measure of the degree of linear associa- 
tion between two variables that is unaffected by the size of the 
unit values is the simple correlation coefficient. A sample-based 
estimate (») of the correlation coefficient is 


Covariance of z and y 25, Say 


= v (Variance of x) (Variance of y) ү (8,2) (8,7) 


The correlation coefficient can vary between —1 and +1. As in 
covariance, a positive value indicates that the larger values of y 
tend to be associated with the larger values of х. A negative value 
indicates an association of the larger values of y with the smaller 
values of z. A value close to +1 or —1 indicates a strong linear 
association between the two variables. Correlations close to zero 
suggest that there is little or no linear association. 

For the data given in the discussion of covariance we found 
—14.4. For the same data, the sample variance of x is 
12.0, and the sample variance of у is 5," = 18.4. Then the 
es mate of the correlation between у and г is 


—144 —144 
F = = ~ _ 0.969 
Т» = 4701210) G84) 1486 








$5 









The negative value indicates that as x increases y decreases, while 
the nearness of r to —1 indicates that the linear association is very 
close. 

Ап important thing to remember about the correlation coeffi- 
cient is that it is a measure of the linear association between two 
variables. A value of r close to zero does not necessarily mean 
that there is no relationship between the two variables. It merely 
means that there is not a good linear (straight-line) relationship. 
"There might actually be a strong nonlinear relationship. 

Tt must also be remembered that the correlation coefficient com- 
puted from а веј of sample data is an estimate, just as the sample 
mean is an estimate. Like the sample, the reliability of a correla- 


16 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


tion coefficient increases with the sample size. Most statistics 
books have tables that help in judging the reliability of a sample 
correlation coefficient, 


Independence 

When no relationship exists between two variables they are said 
to be independent; the value of one variable tells us absolutely 
nothing about the value of the other. The common measures of 
independence (or lack of it) are the covariance and the correla- 
tion coefficient. As previously noted, when there is little or no 
association between the values of two variables, their covariance 
and correlation approach zero (би keep in mind that the converse 
is not necessarily true; a zero correlation does not prove that 
there is no association but only indicates that there is no strong 
linear relationship). 

Completely independent variables are rare in biological popula- 
tions, but many variables are very weakly related and may be re- 
garded as independent. Аз an example, the annual height.growth 
of pole-sized loblolly pine dominants is relatively independent of 
the stand basal area within fairly broad, limits (say 50 to 120 
square feet per acre). There is also considerable evidence that 
periodic cubic volume growth of loblolly pine is poorly associated 
with (i.e., almost independent of) stand basal area over a fairly 
wide range. 

The concept of independence is also applied to sample estimates. 
In this case, however, the independence (or lack of it) may be 
due to the sampling method as well as to the relationship between 
the basic variables. For discussion purposes, two situations may 
be recognized: 


Two estimates have been made of the same parameter. 
Estimates have been made of two different parameters. 


In the first situation, the degree of independence depends en- 
tirely on the method of sampling. Suppose that two completely 
separate surveys have been made to estimate the mean volume per 
acre of a timber stand. Because different sample plots are in- 
volved, the estimates of mean volume obtained from these surveys 
would be regarded as statistically independent. But suppose an 
estimate has been made from one survey and then additional 
sample plots are selected and a second estimate is made using the 
plot data from both the first and second surveys. Since some of 
the same observations enter both estimates, the estimates would 
not be independent. In general, two estimates of a single param- 
eter are not independent if some of the same observations are used 
in both. The degree of association will depend on the proportions 
of observations common to the two estimates. 

Tn the second situation (estimates of two different parameters) 
the degree of independence may depend on both the sampling 
method and the degree of association between the basic variables. 
If mean height and mean diameter of a population of trees were 
estimated by randomly selecting a number of individual trees and 
measuring both the height and diameter of each tree, the two 
estimates would not be independent. The relationship between 


ша а um im 


БЕ 


- 


шн ин ша ин ща ин 








— : 





ELEMENTARY. FOREST SAMPLING 17 


the two estimates (usually measured by their covariance or сог- 
relation) would, in this case, depend on the degree of association 
between the héight and diameter of individual trees. On the other 
hand, if one set of trees were used to estimate mean height and 
another set were selected for estimating mean diameter, the two 
estimates would be statistically independent even though height 
and diameter are not independent when measured on the same tree. 

A measure of the degree of association (covariance) between 
two sample estimates is essential in the evaluation of the sampling 
error for several types of surveys. For the sampling methods de- 
Scribed in this handbook, the procedure for computing the covari- 
ance of two estimates will be given when needed. 


Variances of Products, Ratios, and Sums 


In a previous section, we learned that if a quantity is estimated 
ав the product of а constant and a variable (say Q — kz, where 
k is a constant and z is a variable) the variance of Q will be 
sg? = 8,2, Thus, if we wish to estimate the total volume of a 
stand, we would multiply the estimated mean per unit (7, a vari- 
able) by the total number of units (N, a constant) in the popula- 
tion. The variance of the estimated total will be №232. Its stand- 
ard deviation (or standard error) would be the square root of its 
variance or Ns;. 

The variance of а product.—In some situations the quantity in 
which we are interested will be estimated as the product of two 
variables and.a constant. Thus, 


©: = kzw 
where: К = а constant and 


z and w = variables having variances 3.? and s,? and 
covariance в. 


For large samples, the variance of Q; is estimated by 


84 | 82 
s? = Qe (54 $5 + Ber) 

As an example of such estimates, consider a large forest survey 
project which uses a dot count on aerial photographs to estimate 
the proportion of an area that is in forest (p), and a ground 
cruise to estimate the mean volume per acre (9) of forested land. 
To estimate the forested acreage, the total acreage (№) in the 
area is multiplied by the estimated proportion forested. This in 
turn is multiplied by the mean volume per forested acre to give 
the total volume. In formula form 


Total volume = М (7) (2) 


Where: N — The total acreage of the area (a known constant) . 
p The estimated proportion of the area that is forested. 
$ == The estimated mean volume per forested acre. 





18 AGRICULTURE HANDBOOK 232, 0.5. DEPT. OF- AGRICULTURE 


The variance of the estimated total volume would be 
e- (wo wY (Ж + + е) 


ТЕ the two estimates are made from separate surveys, they are 
assumed to be independent and the covariance set equal to zero. 
This would be the situation here where ф is estimated from a 
photo dot count and ? from an independently. selected set of 
ground locations. With the covariance set equal to zero, the vari- 
ance formula would be 


= 2/8 в 
s (va o» (+) 
Variance of а ratio.—In other situations, the quantity we. are 


interested in will be estimated as the ratio of two estimates multi- 
plied by a constant. Thus, we might have 


z 
9: = kọ 
For large samples, the variance of Q, can be approximated by 


a 82 ( 85 28m 
sa == 92 | | ae 
а = 08 [+ Ве 
This formula comes into use with the ratio-of-means estimator 
described in the section on regression estimators. 


Variance of a sum.—Sometimes we might wish to use the sum 
of two or more variables as an estimate of sóme quantity. With 
two variables we might have 

Qs = Кил + ats 


where: kı and Ко = constants 
a, and го = variables having variance 81? and 8,2 
and covariance 812 


The variance of this estimate is 
80? = Кв? + Кв + 2003/5812 
ТЕ we measure the volume of sawtimber (x) and the volume of 
poletimber (y) on the same plots (and in the same:units of meas- 
ure) and find the mean volumes to be # and 7, with variances 
ва and 82 and covariance ss, then the mean total volume in pole- 
size and larger trees would be 
Mean total volume = £ -- 7 
The variance of this estimate is 


8? = 8j вр + 285 


mS ud на ша 


=} 


ELEMENTARY FOREST SAMPLING 19 


The same result would, of ‘course, be obtained by totaling Ше г 
and y values for each plot and then computing the variance of 
the totals. 

This formula is also of use where a weighted mean is to be com- 
puted. For example, we might have made sample surveys of two 
tracts of timber. 

Tract 1 
Size — 8,200 acres 
Estimated mean volume per acre — 4,800 board feet 
Variance of the mean = 112,500 board feet 
Tract 2 
Size — 1,200 acres 
Estimated mean volume per acre — 7,400 board feet 
Variance of the mean = 124,000 board feet 
In combining these two means to estimate the overall mean volume 
per acre we might want to weight each mean by the tract size 
before adding and then divide the sum of the weighted means by 
the sum of the weights (this is the same as estimating the total 
volume on both tracts and dividing by the total acreage to get the 
mean volume per acre): Thus, 


в 3200 (4800) -- 1200 (7400) 
e (3200 + 1209) 


3200 1200 
- (dino) (4800) + (ее) (1400) = 5509 


Because the two tract means were obtained from. independent 
samples, the covariance between the two estimates is zero, and the 
variance of the conibined'estimáte would be 


3200* к 1200V |. 
4р х) (112,500) + ( б) (124,000) 
== (8200)2(112,500) -- (1200)? (124,000) 
(4400)? 

= 68,727. 

The general rule for the variance of a sum is if 
Q km + kato + Киа 

where: k; = constants 
z = variables with variances s? and covariance Si, 
then 


SQ == ku)? Кв? +... + Кв, + 2k. 
+ 2 в 4... 4 вил Ковани 
Transformation of Variables 


Many of the procedures described in this handbook imply cer- 
tain assumptions about the nature of the variable being studied. 








20 AGRICULTURE HANDBOOK 282, 0.8. DEPT. OF AGRICULTURE 


When a variable does not fit the assumptions for a particular pro- 
cedure some other method must be used or the variable must be 
changed (transformed). 

One of the common assumptions is that variability is inde- 
pendent of the mean. Some variables (e.g. those that follow a 
binomial or Poisson distribution) tend to have a variance that is 
in some way related to the mean—populations with large means 
often having large variance. In order to use procedures that as- 
sume that there is no relationship between the variance and the 
mean, these variables are frequently transformed. The transfor- 
mation, if properly selected, puts the original data оп а scale in 
which its variability is independent of the mean. Some common 
transformations are the square root, arcsin, and logarithm. The 
arcsin transformation is illustrated on page 66. 

If a method assumes that there is a linear relationship between 
two variables, it is often necessary to transform one or both of 
the variables so that they satisfy this assumption. A variable may 
also be transformed to corivert Из distribution to the normal on 
which many of the simpler statistical methods are based. 

The amateur sampler will do well to seek expert advice when 
transformations are being considered. 

Finally, it should be noted that transformation is not synony- 
mous with coding, which is done to simplify computation. Nor is 
it à form of mathematical hocus-pocus aimed at obtaining answers 
that are in agreement with preconceived notions. 


SAMPLING METHODS FOR CONTINUOUS VARIABLES 


Simple Random Sampling 


АП of the sampling methods to be described in this handbook 
have their roots in simple random sampling. Because it is basic, 
the method will be discussed in greater detail than any of the 
other procedures. 

The fundamental idea in simple random sampling is that, in 
choosing a sample of n units, every possible combination of n units 
should have an equal chance of being selected. "This is not the 
same as requiring that every unit in the population have an equal 
chance of being selected. The latter requirement is met by many 
forms of restricted randomization and even by some systematic 
selection methods. 

Giving every possible combination of » units an equal chance 
of appearing in a sample of size n may be:diffieult to visualize but 
is easily accomplished. It is only necéssary to be sure that at any 
stage of the sampling the selection of a particular unit is in no 
way influenced by the other units that have been selected. To state 
it in another way, the selection of any given unit should be com- 
pletely independent of the selection of all other units. One way 
to do this is to assign every unit in the population a number and 
then draw n numbers from а table of randóm digits (table 1, 
p. 82). Or, the numbers can be written on some equal-sized disks 
or slips of paper which are placed in a bowl, thoroughly mixed, 


ив um ша мы иш иш мм ма ша шщ шщ иш иц NM 


C 








a 


a ЕЗ 


mS ша ши шш 





= 


= ш 


БЕ 


- 


ED m но 


gon 





= 


ELEMENTARY FOREST SAMPLING 21 


and then drawn опе at a time. For units such as individual tree 
seeds, the units themselves may be drawn at random. 

The units may be selected with or without replacement. If selec- 
tion is with replacement, each unit is allowed to appear in the 
sample as often as it is selected. In sampling without replacement, 
а particular unit is allowed to appear in the sample only once. 
Most forest sampling is without replacement. As will be shown 
later, the procedure for computing standard errors depends on 
whether sampling was with or without replacement. 

Sample selection.—The selection method and computations may 
be illustrated by the sampling of a 250-acre plantation. The objec- 
tive of the survey was to estimate the mean cordwood volume per 
acre in trees more than 5 inches d.b.h. outside bark. The popula- 
tion and sample units were defined to be square quarter-acre plots 
with the unit value being the plot volume. The sample was to con- 
sist of 25 units selected at random and without replacement. 

The quarter-acre units were plotted on a map of the plantation 
and assigned numbers from 1 to 1,000. From a tabie of random 
digits, 25 three-digit numbers were selected to identify the units 
to be included in the sample (the number 000 was associated with 
the plot numbered 1,000). No unit was counted in the sample more 
than once. Units drawn a second time were rejected and an alter- 
native unit was randomly selected. 

P mE cordwood volumes measured on the 25 units were as 
ollows : 


т 10 у; 4 7 
8 8 8 T 5 
2 6 9 T 8 
6 T^ 11 8 8 
7 3 8 7 7 

Total = 175 


Estimates.—If the cordwood volume оп the i sampling unit is 
designated y, the estimated mean volume (7) per sampling unit is 








ВИ 74+8424...47 
n 25 
= 7 cords per quarter-acre plot, 


175 
? 25 


The mean volume per acre would, of course, be 4 times the mean 
volume per quarter-acre plot, or 28 cords. 
As there is a total of N — 1,000 quarter-acre units in the 250- 


acre plantation, the estimated total volume ($) in the planta- 
tion would be 


Y = Ng = (1,000) (7) = 7,000 cords. 
Alternatively, 
Ý = (28 cords per acre) (250 acres) = 7,000 cords. 











22 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


Standard errors.—A first step in computing the standard error 
of estimate is to make an estimate (s,?) of the variance of indi- 
vidual values of y. 





In this example, 








EL 
"n 25 
id 25 — 1) 

= 817 — 1,226 = 3.8333 cords 


24 


When sampling is without replacement the standard error of 
the mean (s;) for а simple random sample is 


Ir n 
a= (1-2) 
where: N = total number of sample units in the entire population, 


n = number of units in the sample. 
For the plantation survey, 


\/ (1533) C975). 





= 0.387 cord 


This is the standard error for the mean per quarter-acre plot. By 
the rules for the expansion of variances and standard errors, the 
standard error for the mean volume per acre will be (4) (0.387) 
= 1.548 cords, 


Similarly, the standard error for the estimated total volume 
(ss) will be 


вр = Ns; = (1,000) (.387) = 887 cords. 


Sampling with replacemcnt.—In the formula for the standard 
error of the mean, the term (1 -p is known as the finite popu- 


lation correction or fpe. It is used when units are selected with- 
out replacement. If units are selected with replacement, the fpe 





ELEMENTARY. FOREST SAMPLING 28 


is omitted and the formula for the standard error of the mean 
becomes 


ENT 
GENN 


Even when sampling is without replacement the sampling frac- 
tion (n/N) may be extremely small, making the fpe very close to 
unity. If n/N is less than 0,05, the fpc is commonly ignored and 
the standard error computed from the shortened formula. 

Confidence: limits for large samiples.—By itself, the estimated 
mean of 28 cords per acre doés not tell us very much. Had the 
sample consisted of only 2 observations we might conceivably have 
drawn the quarter-acre plots having only 2 and 3 cords, and the 
éstimated mean would be 10 cords per acre. Or if we had selected 
the plots with 10 and 11 cords, the mean would be 42 cords per 
acre. 

To таке ап estimate meaningful it is necessary to compute 
confidence limits that indicate the range within which we might 
expect (with some specified degree of confidence) to find the 
parameter. As was discussed in the chapter on standard errors, 
the 95-percent confidence limits for large samples are given by 


Estimate + 2 (Standard Error of Estimate) 


Thus the mean volume per acre (28 cords) that had a standard 
error of 1.548 cords would have confidence limits of 


28 + 2(1.548) — 24.90 to 31.10 cords per acre. 


And the total volume of 7,000 cords that had a standard error of 
887 cords would have 95-percent confidence limits of 


7,000 + 2 (887) = 6,226 to 7,774 cords. 


Unless а 1-in-20 chance has occurred in sampling, the popula- 
tion mean volume per acre is somewhere between 24.9 and 31.1 
cords, and the true total volume is between 6,226 and 7,774 cords. 

Because of sampling variation, the 95-percent confidence limits 
will, on the average, fail to include the parameter in 1 case out 
of 20. It must be emphasized, however, that these limits and the 
confidence statement take account of sampling variation only. 
They assume that the plot values are without measurement error 
and that the sampling and estimating procedures are unbiased 
and free of computational mistakes. If these basic assumptions are 
not valid, the estimates and confidence statements may be nothing 
more than a statistical hoax. 

Confidence limits for small samples.—Ordinarily, large-sample 
confidence limits are not appropriate for samples of less than 80 
observations. For smaller samples the proper procedure depends 
on the distribution of the unit values in the parent population, a 
subject that is beyond the scope of this handbook. Fortunately, 
many forest measurements follow the bell-shaped normal distribu- 
tion, or a distribution that can be made nearly normal by trans- 
formation of the variable. 











24 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF. AGRICULTURE 


For samples of any size from normally distributed. populations, 
Student's t value can be used to compute confidence limits. The 
general formula is 


Estimate + (t) (Standard Error of Estimate). 


The values of t have been tabulated (table 2, page 86). The par- 
ticular value of t to be used depends on the degree of confidence 
desired and on the size of the sample. For 95-percent confidence 
limits, the ¢ values are taken from the column for a probability 
of .05. For 99-percent confidence limits, the $ value would come 
from the .01 probability column. Within the specified columns, the 
appropriate t for а simple random sample of n observations is 
found in the row for (n — 1) df's (degrees of freedom?). For a 
simple random sample of 25 observations the ѓ value for comput- 
ing the 95-percent confidence limits will be found in the .05 column 
and the 24 df row. This value is 2.064. Thus, for the plantation 
survey that showed a mean per-acre volume of 28 cords and a 
standard error of the mean of 1.548 cords, the small-sample 95- 
percent confidence limits would be 


28 + (2.064) (1.548) = 24.80 to 31.20 cords 


The same £ value is used for computing the 95-percent confidence 
limits on the total volume. As the estimated total was 7,000 cords 
т а standard error of 387 cords, the 95-percent confidence 
imits are 


7,000 == (2.064) (387) = 6,201 to 7,799 cords. 


Size of sample—In the example illustrating simple random 
sampling, 25 units were selected. But why 25? Why not 100? Or 
10? АП too often the number depends on the sampler's view of 
what looks about right. But there is a somewhat more objective 
solution. That is to take only the number of observations needed 
to give the desired precision. 

In planning the plantation survey, we could have stated that 
unless a 1-in-20 chance occurs we would like our sample estimate 
of the mean to be within + E cords of the population mean. As 
the small-sample confidence limits are computed as ў + t(s;), this 
is equivalent to saying that we want 


t(s) =E 
For а simple random sample 


= |32 (1 =; х) 


2 In this handbook the expression "degrees of freedom" refers to а parameter 
in the distribution of Student's t. When a tabular value of t is required, the 
number of degrees of freedom (df's) must be specified. The expression is not 
жу explained in nonstatistical language. One definition is that the df's are 
equal to the number of observations in a sample minus the number of inde- 
pendently estimated parameters used in calculating the sample variance. Thus, 
in a simple random sample of п observations the only estimated parameter 
needed in calculating the sample variance is the mean (2), and so the df's 
would be (n — 1). 





m ип вш ш иш ш ш ша 


еа 





Иш ип иш иш иш иш ша 


= nu 


= um 


Es) =i 





=) 


ша 


ELEMENTARY FOREST SAMPLING 25 
Substituting for s; in the first equation we get 


o JEG) =» 


Rewritten in terms of the sample size (n) this becomes 


Е? 1 
Faz tN 


To solve this relationship for n, we must have some estimate 
(8,2) of the population variance. Sometimes the information is 
available from previous surveys. In the illustration, we found 
8,? = 8.88, a value which might be taken as representative of the 
variation among quarter-acre plots in this or similar populations. 
In the absence of this information, a small preliminary survey 
might be made in order to obtain an estimate of the variance. 
When, as often happens, neither of these solutions is feasible, a 
very crude estimate can be made from the relationship 


“=@ 


where: В = estimated range from the smallest to the largest unit 
value likely to be encountered in sampling. 

For the plantation survey we might estimate the smallest y-value 

on quarter-acre plots to be 1 cord and the largest to be 10 cords. 

As the range is 9, the estimated variance would be 


s= (+) = 5.06 


This approximation procedure should be used only when no other 
estimate of the variance is available. 

Having specified a value of E and obtained an estimate of the 
variance, the last piece of information we need is the value of t. 
Неге we hit somewhat of a snag. To use ¢ we must know the 
number of degrees of freedom. But, the number of df's must be 
(n — 1) and п is not known and cannot be determined without 
knowing t. 

An iterative solution will give us what we need, and it is not as 
difficult as it sounds. The procedure is to guess at a value of m, 
use the guessed value to get the degrees of freedom for $, and 
then substitute the appropriate t value in the sample-size formula 
to solve for a first approximation of n. Selecting a new n some- 
where between the guessed value and the first approximation, but 
closer to the latter, we compute a second approximation. The proc- 
ess is repeated until successive values of п are the same or only 
slightly different. Three trials usually suffice. 

To ?llustrate the process, suppose that in planning the planta- 
tion survey we had specified that, barring a 1-in-100 chance, we 
would like the estimate to be within 3.0 cords of the true mean 





26 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


volume per acre. This is equivalent to Е = 0.75 cord per quarter- 
acre. From previous experience, we estimate the population vari- 
ance among quarter-acre plots to be 8,2 = 4, and we know that 
there is a total of N = 1,000 units in the population, To solve for 
n, this information is substituted in the sample-size formula given 
on page 25. 





"= (9715) 1 


(E)(4) + 1000 
We will have to use Ше £ value for the .01 probability level, but 
we do not know how many degrees of freedom t will have without 
knowing n. As a first guess, we can try n = 61; then the value 
of t with 60 degrees of freedom at the .01 probability level is 
t = 2.66. Thus, the first approximation will be 


1 1 

(008) — 1 = 5695 "EE 
(2.667) (4) Т 1,000 (7.0756) (4) © 1.000 
= 47.9 


m= 








A second guessed value for т would be somewhere between 61 and 
48, but closer to the computed value. We might test n = 51, for 
which the value of t (50 df's) at the .01 level is about 2.68, whence 


1 

5625 
(T3824) (4) + 
=486 


The desired value is somewhere between 51 and 48.6 but much 
closer to the latter. Because the estimated sample size is, at best, 
only a good approximation, it is rather futile to strain on the com- 
putation of п. In this case we would probably settle on n =: 50, 
a value that could have been easily guessed after the first approxi- 
mation was computed. 


m= 





If the sampling fraction у is likely to be small (say, less than 


0.05), the finite-population correction ( 1 — * may be ignored 
in the estimation of sample size and the formula simplified to 





This formula is also appropriate in sampling with replacement. 
In the previous example the simplified formula gives an estimated 
sample size of п = 51. 
The short formula is frequently used to get a first approxima- 
*tion of n. Then, if the sample size indicated by the short formula 


um mm на 


= ещ = 


em um mS ищ шщ E UN 





— 





ELEMENTARY FOREST SAMPLING 27 


is a considerable proportion (say over 10 percent) of the number 
of units in the population and sampling will be without replace- 
ment, the estimated sample size. is recomputed with the long 
formula. 

Effect of plot size on variance.—In estimating sample size, the 
effect of plot size and the scale of the unit values on variance must 
be kept in mind. То the plantation survey a plot size of one-quarter 
acre was selected and the variance among plot volumes was esti- 
mated to be s? = 4.\'This is the variance among volumes per 
quarter-acré, Because the desired precision was expressed on a 
per-acre basis it was necessary to modify either the precision 
specification or s? to get them on the same scale. In the example, 
82 was used without change and the desired precision was divided 
by 4 to put it on a quarter-acre basis. The same result could have 
been obtained by leaving the specified precision unchanged and 
putting the variance on a per-acre basis. Since the quarter-acre 
volumes would be multiplied by 4 to put them on a per-acre basis, 
the variance of quarter-acre volumes should be multiplied by 16. 
(Remember: If x is а variable with variance 8?, then the variance 
of a variable z = ка is k?s?). 

Plot size has an additional effect on variance. At the same scale 
of measurement, small plots will almost always be more variable 
than large ones. The variance in volume per acre on quarter-acre 
plots would be somewhat larger than the variance in volume per 
acre оп half-acre plots, but slightly smaller than the variance in 
volume per acre of fifth-acre plots. Unfortunately, the relation of 
plot size to variance changes from one population to another. 
Large plots tend to have a smaller variance because they average 
out the effect of clumping and holes. In very uniform populations, 
changes in plot.size have little effect on variance. In nonuniform 
populations the relationship of plot size to variance will depend 
on how the sizes of clumps and holes compare to the plot sizes. 
Experience is the best guide as to the effect of changing plot size 
on variance, Where neither experience nor advice is available, a 
very rough approximation can be obtained by the rule: 


If plots of size P, have a variance 31? then, оп the same scale 
of ME EU plots of size P, will have a variance roughly 
equal to 


s = 82 УРИР» 
Thus, if the variance in cordwood volume per aere on quarter-acre 


plots is 312 = 61, the variance in cordwood volume per acre on 
tenth-acre plots will be roughly 


61 /025/0.10 = 96 
The same results will be obtained without worry about the scale 


of measurement if the squared coefficients of variation (C?) are 
used in place of the variances. The formula would then be 


C? = С V PP; 





28 AGRICULTURE HANDBOOK 282, Џ.8. DEPT. OF AGRICULTURE 


Practice problem.—A survey is to be made to estimate the mean 
board-foot volume per acre in а 200-acre tract. Barring a 1-in-20 
chance, we would like the estimate to be within 500 board feet of 
the population mean. Sample plots will be one-fifth acre. A survey 
in a similar tract showed the standard deviation among quarter- 
acre plot volumes to be 520 board feet, What size sample will be 
needed? 

Problem Solution: 
The variance among quarter-acre plot volumes is 520? — 270,- 


400. For quarter-acre volumes expressed on a per-acre basis the 
variance would be 


81? — (4?) (270,400) = 4,826,400 


The estimated variance among fifth-acre plot volumes expressed 
on a per-acre basis would then be 


ва сев (Р 4,826,400 «| 


= (4,326,400) (1.118) 
= 4,836,915 


025 
0.20 





The population size is N = 1,000 fifth-acre plots. 

If as a first guess n = 61, the t value at the .05 level with 60 
degrees of freedom is 2.00. The first computed approximation of 
n is 


1 


170 000) ст 
(4) (4,836,915) + 1,000 


n = 1.8 


The correct solution is between 61 and 71.8 but much closer to the 
computed value. Repeated trials will give values between 71.0 and 
71.8. The sample size (n) must be an integral value and, because 
71 is too small, a sample of » — 72 observations would be re- 
quired for the desired precision. 


Stratified Random Sampling 


Often we have knowledge of a population which can be used to 
increase the precision or usefulness of our sample. Stratified ran- 
dom sampling is a method that takes advantage of certain types 
of information about the population. 

In stratified random sampling, the units of the population are 
grouped together on the basis of similarity of some characteristic. 
Each group or stratum is then sampled and the group estimates 
are combined to give a population estimate. 

In sampling a forest, we might set пр strata corresponding to 
the major timber types, make separate sample estimates for each 
type, and then combine the type data to give an estimate for the 
zntire population. If the variation among units within types is 


с 


ME NR НЕ НЕ EN иш 


ва НИ га 


Се 


НЕ НЕ иш NN 


Е 








Ши ий иш NE 


= иш 


иш Еш 


ш ма ва 


ELEMENTARY FOREST SAMPLING 29 


less than the variation among units that are not in the same type, 
the population estimate will be more precise than if sampling had 
been at random over the entire population. 

The sampling and computational procedures can be illustrated 
with data from a cruise made to estimate the mean cubic-foot 
volume per acre on an 800-acre forest. On aerial photographs the 
tract was divided into three strata corresponding to three major 
forest types; pine, bottom-land. hardwoods, and upland hardwoods. 
The boundaries and total acreage of each type were known. Ten 
one-acre plots were selected at random and without replacement 
in each stratum. 


Stratum Observations 


1. Pine 570 510 600 
640 590 780 Total = 6,100 
480 670 700 


Il. Bottom-land hardwoods 520 630 810 
110 160 580 Total = 7,870 
то 890 860 


11. Upland hardwoods 420 540 820 
210 180 270 Total = 8,040 
290 260 200 


Estimates.—The first step in estimating the population mean 
per unit is to compute the sample mean (ў) for each.stratum. 
The уе is the same as for the mean of a simple random 
sample. 


й = 6,100/10 = 610 cubic feet per acre for the pine type 
Фи = 7,870/10 = 737 cubic feet per acre for bottom-land hard- 


woods 
йш = 8,040/10 = 304 cubic feet per acre for upland hardwoods 
The mean of a stratified sample (Фу) is then computed by 


É Nha 
фи = 





Where: L = The number of strata. 
№, = The total size (number of units) of stratum 


h=l,..., 
N = The total number of units in all strata 
N=% №); 
k=l 





30 AGRICULTURE HANDBOOK 232, U.S: DEPT. OF AGRICULTURE 
If the strata sizes are 


I Pine = 
П. Bottom-land hardwoous = 
III. Upland hardwoods = 

Total = 80 0 acres = № 





| 
Then the estimate of the population mea is 


(320) (610) + (140) (737) + (340) (304) 
ди 800 
— 502.175 cubie feet per acre 





For the estimate of the population total (Y,), simply omit the 
divisor N. 


L 
Y= А Ху = 320 (610) + 140 (787) + 340 (304) = 401,740 
=1 
Alternatively, 
Y, = Nj, = 800 (502.175) = 401,740 
Standard errors.—To determine standard errors, it is first 
necessary to obtain the estimated variance among individuals 
within each stratum (s,?). These variances are computed in the 


same manner as the variance of a simple random sample. Thus, 
the variance within Stratum! (Pine) is 


(570° + 6402 +... + 7002) — (8100: 


(10 — 1) 
_ 3,794,000 — 3,721,000 
* 9 


5? 








= 8111.1111 
Similarly, 


зи? = 15,556.6667 
Sm? = 12,204.4444 


From these values we find the standard error of the mean of a 
stratified random sample (87) by, the formula 


„= A [Мы ДА 
ра №] my (ем 


Where: n, = Number of units observed in stratum h. 











mu mi па 


НЕ НИ им ш 


Кай | 


Шш sn sd md ш 


md ми 


= 


ELEMENTARY FOREST SAMPLING 81 
This looks rather ferocious and does get to be а fair amount of 


work, but it is not too bad if taken step by step. For the timber 
cruising example we would have 


[1 еа i 














Siu = \/ 8008 10 — 320 
_ (840) (12.204.4444) (| _ 10 
zt 10 340 
= 383920650 
= 19.594 


Аз а rough rule we can say that unless а 1-11-20 chance has 
occurred, the population mean is included in the range 


Doe = 2 (554) = 502.175 + 2(19.594) 
== 468 to 541 


ТЕ sampling is with replacement or if the sampling fraction 
within a particular stratum (»,/N,) is small, we can omit the 
finite-population. correction (1 -- x) for that particular stratum 

h 
when calculating the standard error. 
The population total being estimated by Y, = Nj, the stand- 


ard error of 7. is simply 
вру, = № ,„ = 800(19.594) = 15,675 


Discussion.—Stratified random sampling offers two primary 
advantages over simple random sampling. First, it provides sepa- 
rate estimates of the mean and variance of each stratum. Second, 
for a given sampling intensity, it often gives more precise esti- 
mates of the population parameters than would a simple random 
sample of the same size. For this latter advantage, however, it is 
necessary that the strata be set up so that the variability among 
unit values within the strata is less than the variability among 
units that are not in the same stratum. 

Some drawbacks are that each unit in the population must be 
assigned to one and only one stratum, that the size of each stratum 
must be known, and that a sample must be taken in each stratum. 
The most common barrier to the use of stratified random sampling 
is lack of knowledge of the strata sizes. If the sampling fractions 
are small in each stratum, it is not necessary to know the exact 
strata sizes; the population mean and its standard error can be 
computed from the relative sizes. If 7, = the relative size of stra- 
tum h, the estimated mean is 


L 
2 туба 
за 
Ва = МЕ 


Bn 


ái 








32 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


The estimated standard error of the mean is 





It is worth repeating that the sizes or relative sizes of the strata 
must be known in advance of sampling; the error formulae given 
above are not applicable if the observations from which the strata 
means are estimated are also used to estimate the strata sizes. 


Sample Allocation in Stratified Random Sampling 


Assuming we have decided on a total sample size of 2 observa- 
tions, how do we know how many of these observations to make 
in each stratum? Two common solutions to this problem are known 
às proportional and optimum allocation. 

Proportional allocation.—In this procedure the proportion of 
the sample that is selected in the ил stratum is made equal to the 
proportion of all units in the population which fall in that stratum. 
If a stratum contains half of the units in the population, half of 
the sample observations would be made in that stratum. In equa- 
tion form, if the total number of sample units is to be n, then for 
proportional allocation the number to be observed in stratum № 18 


n=)» 


In the previous example, the 30 sample observations were divided 
equally among the strata. For proportional allocation we. would 


have used 
m = (0) «= (80 30 =12 


800 
— (200) 30 = 5.25 or 5 


тип = (300) 30 = 12.75 or 13 


Optimum allocation.—In optimum allocation the observations 
are allocated to the strata so as to give the smallest standard error 
possible with a total of n observations. For a sample of size n, the 
number of observations (n) to be made in stratum № under 


optimum allocation is 
m= e n 
51 Музу 
— 














EM mu mu кш НО НИ me ш 


mu и ш 


= 


га ща 


md au) 


ELEMENTARY FOREST SAMPLING 88 


In terms of the previous example the value of №8, for each 
stratum. is 


№: =820\/811Ї.ТЇЇЇ` --820(90.06) = 28,819.20 
Миёи = 140\/15,556.6667 = 140 (124.73) = 17,462.20 





Миз = 340:/12,204.4444 = 840 (110.47) = 37,559.80 
Total — 83,841.20 — 2 КАА 


Applying these values in the formula, we would get 


28,819.20\ 40 - 
тош (1150) 30 = 103 or 10 
17,462.20) — 
ш = (588419) 30 = 6.2 ог 6 
37,559.80... 
ти = (58190) 30 = 184 or 14 


Here optimum allocation is not much different from proportional 
allocation. Sometimes the difference is great. 


Optimum Allocation With Varying Sampling Costs 


Optimum allocation as just described assumes that the sampling 
cost per unit is the same in all strata. When sampling costs vary 
from one stratum to another, the allocation giving the most in- 
formation per dollar is 


Мв, 
Ме» 


Mm x =) 
у 
Where: с, = Cost per sampling unit in stratum h. 

The best way to allocate a sample among the various strata de- 
pends on the primary objectives of the survey and our information 
about the population. One of the two forms of optimum allocation 
is preferable if the objective is to get the most precise estimate 
of the population mean for a given cost. If we want separate esti- 
mates for each stratum and the overall estimate is of secondary 
importance, we may want to sample heavily in the strata having 
high-value material. Then we would ignore both optimum and pro- 
portional allocation and place our observations so as to give the 
degree of precision desired for the particular strata. 

We cannot, of course, use optimum allocation without having 
some idea about the variability within the various strata. The 
appropriate measure of variability within the stratum is the 
standard deviation (not the standard error), but we need not 
know the exact standard deviation (s,) for each stratum. In place 
of actual s, values, we can use relative values. In our example, if 





84 AGRICULTURE HANDBOOK 232, 17.8. DEPT. OF AGRICULTURE 


we had known that the standard deviations for the strata were 
about in the proportions 8;:87:3/; = 9:12:11, we could have used 
these values and obtained about the same allocation. Where opti- 
mum allocation is indicated but nothing is known about the 
strata standard deviations, proportional allocation is often very 
satisfactory. 

Caution! In some situations the optimum allocation formula 
will indicate that the number of units (пл) to be selected in a 
Stratum is larger than the stratum (№) itself. The common pro- 
cedure then is to sample all units in the stratum and to recompute 
the total sample size (n) needed to obtain the desired precision. 
The method of estimating n is discussed in the next section. 


Sample Size in Stratified Random Sampling 
In order to estimate the total size of sample (п) needed in a 
stratified random sample, the following pieces of information are 
required: 
A statement of the desired size of the standard error of the 
mean. This will be symbolized by D. 
A reasonably good estimate of the variance (8,2) or standard 
deviation (s,) among individuals within each stratum. 


The method of sample allocation. If the choice is optimum 
allocation with varying sampling costs, the sampling cost 
per unit for each stratum must also be known. 


Given this hard-to-come-by information, we can estimate the size 
of sample (n) with these formulae: 


For equal samples in each of the L strata, 


L ^ 
г ўме 
n= — 





L 
4 вм Ё 
For proportional allocation, 
L 
N Ns Ё 
Гей 


m1 
+ 


a — — 
ND? + Уви е 
zi 


For optimum allocation with equal sampling costs among strata, 





ша ма ШЕ шщ D 


ва 


uM um 


m Em НЕ аш 


[e 


m ш 


[Б 





ша ш 


= mU 


3 


Eg 


ELEMENTARY FOREST SAMPLING 35 


For optimum allocation with varying sampling costs among strata, 






L — 
(à Мз, va) 5 
MD + Мне 


n= У 


When the sampling fractions №) are likely to be very small for 
i 

all strata or when sampling will be with replacement, the second 

term of the denominators of the above formulae (2 N,8,? | may 


be omitted, leaving only №Р?, 

If the optimum allocation formula indicates a sample (m) 
greater than the total number of units (№) in a particular stra- 
tum, па is usually made equal to Nj; i.e., all units in that particu- 
lar stratum are observed. The previously estimated sample size 
(т) should then be dropped and the total sample size (77) and 
allocation for the remaining strata recomputed omitting the Ny 
and s, values for the offending stratum but leaving N and D 
unchanged. 

As an illustration, assume a population of 4 strata with sizes 
(№) and estimated variances 8,2 as follows: 


Stratum. N, E ^ Ny, Nya? 
Diao таи s 200 400 20 4,000 80,000 
2. 100 900 80 3,000 90,000 
3. 400 400 20 8,000 160,000 
4. 20 19,600 140 2,800 392,000 





М = 120 17,800 122,000 





With optimum allocation (same sampling cost per ций in all 
Strata), the number of observations to estimate the population 
теап with а standard error of D — 1 is 


_ (17,800)? B 
n= 41501) (3) 4- 722,000 = 255.4 or 256 


The allocation of these observations according: to the optimum 
formula. would- be 


4,000 = 

т = ee) 256 = 57.5 or 58 
3,000 Е 

пе = (32300) 256 — 43.1 or 43 





36 AGRICULTURE HANDBOOK 282, 1.6, DEPT. OF AGRICULTURE 


8,000 
т = (по 300) 256 = 11511 or 115 


2,800 
п = (32 riso) 256 = 40.8 
The number of units allocated to the fourth stratum is greater 
than the total size of the stratum. Thus every unit in this stratum 
would be selected (n, — N, — 20) and the sample size for the 
first three strata recomputed. For these three strata, 


E Nas, = 15,000 
>) Хи, = 330,000 


Hence, 


(15,000)? 
т = (тшу 19 + 880,009 = 265 
And the allocation of these observations among the three strata 
would be 


Е ( 4,000 


= 15306) 265 = 70.7 or 71 


m= (15590) 265 = 53.0 or 58 


n = (1200 o0) 265 = 141.3 or 141 


Regression Estimators 


Regression estimators, like stratification, were developed to in- 
crease the precision or efficiency of а sample by making use.of 
supplementary information about the population being studied. 
If we have exact knowledge of the basal area of a stand of timber, 
the relationship between volume and basal area may help us to 
improve our estimate of stand volume. The sample data provides 
information on the volume-basal area relationship which is then 
applied to the known basal area, giving a volume estimate that 
may be better or cheaper than would be obtained by sampling 
volume alone. 

Suppose a 100 percent inventory of a 200-acre pine stand indi- 
cates а basal area of 84 square feet per acre in trees 3.6 inches 
in d.b.h. and larger. Assume further that on 20 random plots, each 
one-fifth acre in size, measurements were made of the basal area 
(х) and volume (y) per acre. 


m ES ES 


m ws 8 


е 


um um ша 


Си 


mme иш 


ЌЕ 


Ба 


с 


gg 


шю sun по на па ED ND EN 








= 





ELEMENTARY FOREST SAMPLING 87 

Basal area. Volume per Basal area Volume per 
per асте (2) асте (у) per acre (a) вете (у) 
(ва. ft.) (ем, ft.) (aq. ft.) (ем. ft.) 
88 1,680 82 1,560 
12 1,460 16 1,560 
80 1,590 86 1,610 
96 1,880 18 1,870 
64 1,240 19 1,490 
48 1,060 85 1,710 
16 1,500 84 1,600 
85 1,620 15 1,440 
93 1,880 -- — 
110 2,140 |Total....1,620 31,860 

88 1,840 

80 1,630 | Меап...... 81 1,598 


Some values that will be needed later are 











n Улу = 2,635,500 

у їг = 1,620 

0 = 1,593 в L8 

Sy? = 51,822,600 xa? = 184,210 
SS, = xy? — Gp — 51,822,600 — Gu = 1,069,620 

_ SS, _ 1,069,620 _ 
s) og = BO = 56,295.79 
2 

SS, = set — Са). 134.210 — 4620)" _ 2,990 


(1,620) (81,860) = 54,840 


SPa, = Sey — Ge) Gy) = 2,635,500 — 
М = total number of fifth-aere plots in the population (= 1,000) 


The relationship between y and z may take one of several forms, 
but here we will assume that it is a straight line. The equation 
for the line can be estimated from 


tr = 9 +b (Х- 2) 


Where: 9» = The mean value of у as estimated from X 
(a specified value of the variable X). 
7 = The sample mean of у (= 1,598). 
+ = The sample mean of г (= 81). 
= The linear regression coefficient of у on 2. 
For the linear regression estimator used here, the value of the 
regression coefficient is estimated by 





_ БР. _ 54,840 _ 
b = By" = 2990 = 1884 





38 AGRICULTURE HANDBOOK 232, 0.5. DEPT. OF AGRICULTURE 


Thus, the equation would be 


да 1,598 + 18.84 (X — 81) 
= 107.46 + 1834 X 


To estimate the mean volume per acre for the tract we substitute 
for X the known mean basal area per acre. 


Gn = 107.46 + 18.34 (84) = 1,648 cubic feet per acre 


Standard error.—In computing standard errors for simple ran- 
dom sampling and stratified random sampling, it was first neces- 
sary to obtain an estimate (8,2) of the variability of individual 
values of y about their mean. То obtain the standard error for a 
regression estimator, we need an estimate of the variability of 
the individual y-values about the regression of y on г. A measure 
of this variability is the standard deviation from regression 
(8,5) which is computed by 





The symbol з,.г bears a strong resemblance to the covariance sym- 
bol (ву) with which it must not be confused. 

Having the standard deviation from regression, the standard 
error of jj, is 


„еве 


= 59.53 Ја 850) C- 


= 18.57 





With such а small sampling fraction e = 0.02), the finite- 
population correction (1 — x) could have been ignored, and the 
standard error would be 13.71. 

It is interesting to compare s;, with the standard error that 
would have been obtained by estimating the mean volume per acre 


за иш НЕ иш UN 


- 


mE uM иш NB аш иш OB 





Ша ud НО NS ма м 


за ш mu 


E 


ELEMENTARY FOREST SAMPLING 39 


from the y-values only. The estimated mean volume per acre would 
have been 7 = 1,598 (compared to 1,648 using the regression 
estimator). The standard error of this estimate would be 


= | (098) ‚295.79 (0.98) 


== 52.52 (compared {о а standard error of 13.57 
with the regression estimator). 


The family of regression estimators.—The regression procedure 
in the above example is valid only if certain conditions are met. 
One of these is, of course, that we know the population mean for 
the supplementary variable (х). As will be shown in the next 
section (Double Sampling), an estimate of the population mean 
сап often be substituted. 

Another condition is that the relationship of y to х must be 
reasonably close to a straight line within the range of x values 
for which y will be estimated. If the relationship departs very 
greatly from a straight line, our estimate of the mean value of 
y wil not be reliable. Often a curvilinear function is more 
appropriate. 

А third condition is that the variance of y about its mean should 
be the same at all levels of х. This condition is difficult to evaluate 
with the amount of data usually available. Ordinarily the question 
is answered from our knowledge of the population or by making 
special studies of the variability of y. If we know the way in 
which the variance changes with changes in the level of х а 
weighted regression procedure may be used. 

Thus, the linear regression estimator that has been described 
is just one of a large number of related procedures that enable us 
to increase our sampling efficiency by making use of supplemen- 
tary information about the population. Two other members of this 
family are the ratio-of-means estimator and the mean-of-ratios 
estimator. 

The ratio-of -means estimator is appropriate when the relation- 
ship of y toe is in the form of a straight line passing through 


the origin and when the standard deviation of у at any given level 
of z is proportional to the square root of x. The ratio estimate 
(9л) of mean y is 


фа ВХ 
Where; R = The ratio of means obtained from the.sample 
= ор 20 
= Ф 9 ух 


X = The known population mean of x. 


40 AGRICULTURE HANDBOOK 282, U.S. DEPT. OF AGRICULTURE 


The standard error of this estimate cari be reasonably approxi- 
mated for large samples by 


e (есен (а) 


Where: 8,2 = The estimated variance of y. 

8,! = The estimated variance of z, 

Sey = The estimated covariance of г and y. 
It is difficult to say when a sample is large enough for the stand- 
ard error formula to be reliable, but Cochran (see References, 
p. 78) has suggested that n must be greater than 30 and also 
large enough so that the ratios s;/7 and 85/2 are both less than 0.1. 

То illustrate the computations, assume that for а population of 

М = 400 units, the population mean of z is known to be 62 and 
that from this population a sample of » — 10 units is selected. 
The y and « values for these 10 units are found to be 








Observation. ЕЛ ЕЛ | Observation. v. ЕД 
Ж 8 62| 8.. 11 96 
2 18 81| 9. 5 36 
3 5 40| 10.. 12 10 
4 6 46 ie = 
5 19 123 Total .... 96 680 
6 9 74| 
m 8 52| Mean .... 9.6 68 








From this sample the ratio-of-means is 
9.6 
R= gg = 0141 
The ratio-of-means estimator is then 
Ür = RX = 0.141 (62) = 8.742 
To compute the standard error of the mean we will need the vari- 


ances of y and x and also the covariance. These values are com- 
puted by the standard formulae for a simple random sample. Thus, 

















(8? + 13? +... + 122) - 98“ 
"E 0-1) = 187111 
(62? + 81° +... -+ 702) m 
ва -- 0—7 = 133.5556 
(8) (62) + (18) (81) +... + 12) (70) — 09489) 
- 0—1) 
== 110.2222 





пи uM 


N ша 


m 


& 


ша ша иш 


ШО. 


& 


29 


ma ax аш GN ш 





EU НО эш ш аш пи GP NS НЕ шш НЕ 


ELEMENTARY FOREST SAMPLING 41 


Substituting these values in the formula for the standard error 
of the mean gives 


P (бн; + ОНИ ОЗ Она) 
" 10 ) 


1 





= ү215690 
= 0.464 


This computation is, of course, for illustrative purposes only. For 
the ratio-of-means estimator, a standard error based on less than 
80 observations is usually of questionable value. 

The mean-of-ratios estimator is appropriate when the relation 
of y to z is in the form of a straight line passing through the 
origin and the rd deviatio; gis 

ri The ratio (rj) of y, to 2; 
is computed for each pair of sample observations. Then the esti- 
mated mean of y for the population is 





Gn = ВХ 


Where: Ё = the mean of the individual ratios (1), i.e., 


To compute the standard error of this estimate we must first 
obtain a measure (3,2) of the variability of the individual ratios 
(7) about their mean. 





The standard error for the mean-of-ratios estimator of mean 
y is then 





42 AGRICULTURE HANDBOOK 232, Џ.5. DEPT. OF AGRICULTURE 


Suppose that а set of т = 10 observations is taken from а 
population of N — 100 units having a mean z value of 40: 





Observation. ГА а, ^ 
1. 36 18 2.00 
2. 95 48 1.98 
3. 108 46 2.85 
4. 172 14 2.32 
6. 126 58 2.17 
6. 58 26 2.28 
T. 123 60 2.05 
8. 98 51 1.92 
9. 54 25 2.16 

10. 14 1 2.00 
21.18 
The sample mean-of-ratios is 
R= 21.18 = 2.118 


And this is used to obtain the meun-of-ratios estimator 
Ür = RX = 2.118(40) = 84.72 
The variance of the individual ratios is 


(2.00? + 1.98? +... 42,002) — LIB 
0 - 0.022484 





s? 








- 1.799 


Numerous other forms of ratio estimators are possible, but the 
above three are the most common. Less common forms involve 
fitting some curvilinear function for the relationship of y to z, or 
fitting multiple regressions when information is available on more 
than one supplementary variable. 

Warning! The forester who is not sure of his knowledge of 
regression techniques would do well to seek advice before adapt- 
ing regression estimators in his sampling. Determination of the 


а ша GS ии 


& 


а ш па НО за ee UN 





ти ш "s 


ELEMENTARY FOREST SAMPLING 48 


most appropriate form of estimator can be very tricky. The two 
ratio estimators are particularly troublesome. They have a simple, 
friendly appearance that beguiles samplers into misapplications. 
The most common mistake is to use them when the relationship 
of y to 2 is not actually in the form of a straight line through the 
origin (ie. the ratio of у to x varies instead of being the same 
at all levels of х). To illustrate, suppose that we wish to estimate 
the total acreage of farm woodlots in a county. As the total area 
in farms can probably be obtainéd from county records, it might 
seem logical to take а sample of farms, obtain the sample ratio 
of mean forested acreage per farm to mean total acreage per 
farm, and multiply this ratio by the total farm acreage to get the 
total area in farm woodlots. This is, of course, the ratio-of-means 
estimator, and its use assumes that the ratio of у to x is a con- 
stant (1.е., can be graphically represented by a straight line pass- 
ing through the origin). Tt will often be found. however, that the 
proportion of a farm that is forested varies with the size of the 
farm. Farms on poor land tend to be smaller than farms on fertile 
land, and, because the poor land 15 less suitable for row crops or 
pasture, a higher proportion of the small-farm acreage may be 
left in forest. The ratio estimate may be seriously biased. 

"The total number of diseased seedlines in a, nursery might be 
estimated by getting the mean pronortion of infected seedlings 
from a number of sample plots and multiplying this proportion 
by the known total number of seedlings in the nursery. Here ахат 
we would be assuming that the proportion of infected seedlings 
is the same regardless of the number of seedlings per plot. For 
many diseases this assumption would not be valid, for the rate’ 
of infection may vary with the seedling density. 


Double Sampling 


Double sampling was devised to permit the use of regression 
estimators when the population mean or total of the supplemen- 
tary variable is unknown. A large sample is taken in order to 
obtain a good estimate of the mean or total for the.supplementary 
variable (х). On a subsample of the units in this large sample, 
the y values are also measured to provide ап estimate of the 
relationship of у to x. The large sample mean or total of г is then 
applied to the fitted relationship to obtain an estimate of the 
population mean or total of y. 
^" Updating a forest inventory is one application of double sam- 
pling. Suppose that in 1950 a sample of 200 quarter-acre plots in 
an 800-acre forest showed a mean volume of 372 cubic feet per 
plot (1,488 cubic feet per acre). A subsample of 40 plots, selected 
at random from the 200 plots, was marked for remeasurement in 
1955. The relationship of the 1955 volume to the 1950 volume as 
determined from the subsample was applied to the 1950 volume 
to obtain a regression estimate of the 1955 volume. 








44 AGRICULTURE HANDBOOK 282, U.S. DEPT. OF AGRICULTURE 


The subsample was as follows: 





1955 1950 1985 1950 
volume volume volume volume 
y 2) а) (= 
370 280 550 430 
290 240 550 460 
520 410 520 400 
490 860 420 390 
580 890 490 340 
330 220 500 420 
310 270 610 470 
400 340 460 350 
450 360 430 340 
430 360 510 380 
460 400 450 370 
480 380 380 300 
430 350 430 290 
500 390 460 340 
640 480 490 370 
660 520 560 440 
490 400 580 480 
510 430 540 420 
270 230 —— —— 
380 270 | Total ..18,820 14,790 

420 880 
580 390 | Mean . .470.50 369.75 


X y? = 9,157,400 
Xa? = 5,661,300 
X гу = 7,186,300 


А plotting of the 40 pairs of plot values on coordinate paper 
suggested that the variability of y was the same at all levels of = 
and that the relationship of y to x was linear. The estimator 
selected on the basis of this information was the linear regression 
Я = а + bX. Values needed to compute the linear-regression 
estimate and its standard error were as follows: 

Large-sample data (indicated by the subscript 1) : 
т = Number of observations in large sample = 200 
М = Number of sample units in population — 3,200 
2, = Large sample mean of т = 372 

Small-sample data (indicated by the subscript 2) : 
fi; = Number of observations in subsample = 40 
G2 = Small sample mean of y = 470.50 
2, = Small sample mean of 2 = 369.75 


А 
88, = (sv = Gy") = (9,187,400 £ авио“) = 802,590.0 


г 


а um ще 


ЕЕ 


к 


B S 


@ 


а 


е 


|: 


[= 





Eg 


ка & ва 


ва 


Ez 


ELEMENTARY FOREST SAMPLING 45 
85, = (за — ФӘ”) = (вввъзоо — CAT") = 192,915 
Pq = (sey — ECD) = (7186200 — 08820 tà 90) 
= 2216050 
SS, - 8026590 





= 7,158.72 


v =m- 


The regression coefficient (b) and the squared standard devia- 
tion from regression (8,.) are 


SPa, _ 2216050 _ 
>= 38, = 1926975 = 118 





(SP,)* Se) 
n (ss ~ “SS. ) (302,590 ~~ 192,697.5 /. 888.2617 
ЧЕСТ = 2) 40—2 = 998, 


And the regression equation is 


Ява = Jo +b (X — 44) 
— 410.50 -- 1.18 (X — 369.75) 
= 34.2 4 118X 


Substituting Ше 1950 mean volume (372 cubic feet) for X gives 
the regression estimate of the 1955 volume. 


Gna = 34.2 + 1.18 (372) = 473.16 cubic feet per plot 


Standard error.—The standard error of ўра when the linear- 
regression estimator is used in double sampling is 


t» = Jer („+ дуо) (1-2) +2 (1-H) 


Е 1 (87200 — 869.75)? 40 
= (888.2617 (4 t 192,697.5 ) (1 ) 


1 775872 200 
+ 2500 (1-б) 


= 7.36 cubic feet 

















Had the 1955 volume been estimated from the 40 plots without 
taking advantage of the relationship of y to z, the estimated mean 
would have been 


фо E = 470.50 cubic feet (instead of 473.16) 





46 AGRICULTURE HANDBOOK 232, Џ.8. DEPT. OF AGRICULTURE 


The standard error of ӯ would have been 


-EG a 
ими ( x) 
= 17758.72 (1 40 ) 
CON 40 = 8,200 
== 18.84 cubic feet (compared tc 7.86) 





Double sampling with other regression estimators—If the 
mean-of-ratios estimate is deemed appropriate, the individual 
ratios (7; = у/а) are computed for the и», observations of the 
subsample. The mean of ratios estimate is then 


Tra = Ва, 


with standard error 








? — Variance of r for the subsample 


Xy)? 





па —1 
The ratio-of-means estimate, when appropriate, is 
бла = Ки 


with standard error 


wea (6) + 80-8) 
Where: В = 9/2. 


8,2 = Variance of y in the subsample. 
8,2 = Variance of x in the subsample. 








Sy, = Covariance of у and х in the subsample. 


| 











290 


y 


#4 ю но 0 aH юа па с па од за ва ва 


E 


ELEMENTARY. FOREST ‘SAMPLING 47 


Sampling When. Units are Unequal in Size 
(Including PPS Sampling) 


Sampling units of unequal size are common in forestry. Planta- 
tions, farms, woodlots, counties, and sawmills are just a few of 
the natural units that vary in size. Designing and analyzing 
surveys involving unequal-sized units can be quite tricky. Two 
examples will be used to illustrate the problem and some of the 
possible solutions. They also illustrate the very important fact 
that no single method is best for all cases and that designing an 
efficient survey requires considerable skill and caution. 


Example No. 1.—As a first example, suppose that we want to 
estimate the mean milling cost per thousand board feet of lumber 
at southern pine sawmills in a given area. Available for planning 
the.survey is a list of the 816 sawmills in the area and the daily 
capacity of each. The cost information is to be obtained by per- 
sonal interview. 

In sampling, as in most other endeavors, the simplest approach 
that will do the job is the best; complex procedures should be 
used only when they offer definite advantages. On this principle 
we might first consider taking a simple random sample of the 
mills, obtaining the cost per thousand at each, and computing the 
arithmetic average of these values. Most foresters would reject 
this procedure, and rightly so. The design would give the same 
weight to the cost for a mill producing 8,000 feet per day as to 
the cost for one cutting 50,000 feet per day. As a result, one thou- 
sand feet at the small mill would have a larger representation in 
the final average than the same volume at the large mill, and 
because cost per thousand is undoubtedly related to mill capacity, 
the estimate would be biased. 

An alternative that would give more weight to the large mills 
would be to take a random sample of the mills, obtain the total 
milling cost (у) and the total production in MBF (ад) at each, 
and then use the ratio-of-means estimator: 


"Total cost at all sampled mills ЗИ 
Total production at all sampled mills за, 





Mean cost per MBF — 


This must also be rejected on the grounds of bias. The ratio-of- 
means estimator is unbiased only if the ratio of y to г is the same 
at all levels of =. In this example, a constant ratio of y to æ means 
that the milling cost per thousand is the same regardless of mill 
size—an unlikely situation. 

An unbiased procedure and one that would be appropriate in 
this situation is sampling with probability proportional to size 
(known as pps sampling). The value to be observed on each 
sample unit would be the milling cost per thousand board feet of 
lumber. Selection of the units with probability proportional to size 
is easily accomplished. 

First, a list is made of all of the mills along with their daily 
capacities and the cumulative sum of capacities. 


48 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 







Daily Cumulative 
мй Мо. capacity (МВР) um 
10 10 
21 87 
8 45 
12 57 
12,210 
12,281 
12,242 


Next, numbers varying in size from 1 up to the cumulative sum 
for the last mill on the list (12,242) are selected from a table of 
random digits. A particular mill is included in the sample when a 
number is drawn which is equal to or less than the cumulative 
sum for that mill and greater than the cumulative sum for the 
preceding mill Thus, given a random number of 49 we would 
select mill number 4; for 37 we would select mill number 2; for 
12,238 we would select mill number 816. An important point is 
that sampling must be with replacement (ie. a given mill may 
appear in the sample more than once) ; otherwise, sampling will 
not be proportional to size. 

After the sample units have been selected and the unit values 
(y, = milling cost per thousand) obtained, the mean cost рег 
thousand and the standard error of the mean are computed in the 
same manner as for simple random sampling with replacement. 

Given the following ten observations: 


“i aa | ма Mange 
РЕВ 
12 |329. 
13 |804 
18 |126 
14 [427 
16 | 
21 





Е 
= 152 = 15.2 dollars per thousand 


The standard error of the mean is 








EN 
9= ү nmi) 
(152): 
_ [2408 — 5 
= ү 1009) 
= 1.04 


Ша за за UM 


E 


ви us ме 


& 


wa за GN GN за 


к 





gg 


ка 05 gy S NU 


J 


ELEMENTARY. FOREST SAMPLING 49 


Another alternative is to group mills of similar size into strata 
and use stratified random sampling. If the cost per thousand is 
related to mill size, this procedure may be slightly biased unless 
all mills in a given stratum are of the same size. With only a small 
within-stratum spread in mill size, the bias will usually be trivial. 

A further refinement would be to group mills of similar size and 
use stratified random sampling with pps sampling of units within 
strata. 

Example Мо. 2.—Now, consider the problem of estimating the 
total daily production of chippable waste at these mills. Assume 
again that we have a list of the mills and their daily capacities. 

We might first consider а simple random sample of the mills 
with the unit observation being the mean daily production of 
chippable waste at the selected mills. The arithmetic average of 
these observations multiplied by the total number of mills would 
give an estimate of the total daily production of chippable waste 
by all mills. This-estimate would be completely unbiased. How- 
ever, because the mills vary greatly in daily capacity and because 
total waste production is closely related to total lumber produc- 
tion, there will be a large variation in chippable waste from unit 
to unit. This means that the variance among units will be large 
and that many observations may be needed to obtain an estimate 
of the desired precision. The simple random sample, though un- 
biased, would probably be rejected because of its low precision. 

The ratio-of-means estimator is a second alternative. In this de- 
sign a simple random sample would be selected and for each mill 
included in the sample we would.observe the mean daily produc- 
tion of chippable waste (у) and the mean daily capacity of the 
mill in MBF (а). The ratio of means 


NET 
ЕС 


would give an estimate of the mean waste production рег МВЕ, 
and this ratio multiplied by the total capacity of all mills would 
estimate the total daily production of chippable waste. It has been 
pointed out that the ratio-of-means estimator is unbiased if the 
ratio of y to x is the same at all levels of z. Studies have shown 
that although the ratio of waste to lumber production varies with 
log size, it is not closely related to mill size—hence the bias, if 
any, in the ratio-of-means estimator would be small. Past experi- 
ence suggests that the variance of the estimate will also be small, 
making it preferable to the simple arithmetic average previously 
discussed, Note that this is а case where a slightly biased estima- 
tor of high precision might be more suitable than an unbiased 
estimator of low precision. 

Here again, pps sampling would merit consideration. It would 
give unbiased estimates of moderately good precision. Stratified 





БО AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


sampling with units grouped according to size is another possi- 
bility as is the combination of stratification with pps sampling 
within strata. Among the acceptable alternatives no blanket rec- 
ommendation is possible. The best choice depends on many factors, 
chief among them being the form and closeness of the relationship 
between chippable waste (y) and mill capacity (а). 


Two-Stage Sampling 

In some forest sampling, locating and getting to a sampling unit 
is expensive, while measurement of the unit is relatively cheap. 
It seems logical in these circumstances to make measurements on 
two or three units at or near each location. This is called two-stage 
sampling, the first stage being the selection of locations, and the 
Second stage being the selection of units at these locations. The 
advantage of two-stage sampling is that it may yield estimates of 
a given precision at a cost lower than that of a completely random 
sample. 

To illustrate the situation and the methods, consider a land- 
owner whose 60,000 acres of timberland are subdivided into 
square blocks of 40 acres with permanent markers at the four 
corners of each block. А sample survey is to be made of the tract 
in order to estimate the mean sawtimber volume per acre. Sample 
units will be square quarter-acre plots. These plots will be located 
on the ground by measurements made with reference to one of 
the corners of the 40-acre blocks. 

Travel and surveying time to a block corner are quite high, 
hence it seems logical, once the block corner is located, to find and 
measure several plots in that block. Thus, the sampling scheme 
would consist of making a random selection of и blocks and then 
randomly selecting m plots within each of the selected blocks. In 
sampling language, the 40-acre blocks would be called primary 
sampling units (primaries) and the quarter-acre plots secondary 
sampling units (secondaries). 

If yy designates the volume of the j sampled plot (j = 1... m) 
on the 7“ sampled block, the estimated mean volume per plot (sym- 
bolized in two-stage sampling by 7) is 








Where: т = Number of primaries sampled. 
N — Total number of primaries in the population. 


m = Number of secondaries sampled in each of the pri- 
maries selected for sampling. 


а gu S M 


C 


me ws GN NN DU 








ma ш 


en my ви 


ELEMENTARY FOREST SAMPLING 51 


М = Total number of secondaries in each primary. 


за? = Sample variance between primaries when sampled by 
т secondaries per primary (computation procedure 
given below). 

ву? == Sample variance among secondaries within primaries 
(computation procedure given below). 


The terms 8,“ and sy? are computed from the equations 


2 м) Е (ё > w) 






— 1) 


Since уу is the observed value of a secondary unit, = yy is the 
total of all secondary nits observed in the i primary (or the 
primary total), and à > yy is the grand total of all sampled 
secondaries. Hence, the ‘above equations, expressed in words, are 


^ mn . 
5) (Primary totals?) [E (Secondaries) | 
No. of secondaries \ "Total no. of 
T sampled per primary secondaries sampled 
aas (n — 1) 








>> (Primary totals?) 
No. of secondaries 
A sampled per primary 
= nm — 1) 


Я (Secondaries?) — ( 





Readers familiar with analysis of variance procedures will recog- 
nize 85? and зи? as the mean square between and within primaries 
respectively. 

The computations are not so difficult as the notation might sug- 
gest. Suppose we had sampled m — 3 quarter-acre plots (second- 
aries) within each of т = 4 blocks (primaries) and obtained the 
following data: 





52 AGRICULTURE HANDBOOK 232, U.S, DEPT. OF AGRICULTURE 


г. "ais шр 
(омо feet) (oubie feet) 
1 1 147 
2 180 
8 206 588 
2 1 312 
2 265 
8 300 877 
8 1 220 
2 280 
Б 210 710 
4 1 250 
2 282 
8 185 667 
2,781 2,781 


= — — = UM — 232.25 cubic feet per plot. 


To get the standard error of ) we first compute зв? and ву, 
(533? +... + 6672) (2,787)? 














БОЕ 3 - ва 
A (СТТ) 
_ 667,402.3333 — 647,280.7500 
E 8 
= 6,707.1944 
(ur + 180? 4... 4 в) „ВВ E. p OU) 
= 
s 0708—10) 
__ 675,463.0000 — 667,402.3333 


8 
== 1,007.5833 
Since Ше total number of 40-асге blocks in the 60,000 acres is 
N — 1,500 and the total number of quarter-acre plots in each 


40-асте block is М = 160, the estimated standard error of the 
mean is 


NECI ака) G " ЕЗ) 





= NES [6,689.3085 + 2.6365] 
= 23.61 





E 


за за 


Са 


& 


ки 


[557] 


E 








| 





EX) 


Бе ва с ON па ми 


ELEMENTARY FOREST SAMPLING 53 


The estimated mean per plot is 232.25 cubic feet. The standard 
error of this estimate is 28.61 cubic feet. As the plots are one- 
quarter acre in size, the estimated mean volume per acre is 4(9) 
= 929 cubic feet, The standard error of the mean volume per acre 
is 4(8у) = 94.44. 

An estimate of the total volume and its standard error can be 
obtained either from the mean per plot or mean per acre volumes 
and their standard errors. The mean per plot is 232.26 -- 28.61. 
To expand thís to the total, each figure must be multiplied by the 
number of quarter-acre plots in the entire tract (= 240,000) ; the 
estimated total 18 


55,740,000 + 5,666,400. 
The mean per acre is 929 + 94.44. To expand this, each figure 
must be multiplied by the total number of acres in the tract 
(= 60,000). Thus, the estimated total is 
55,740,000 -- 5,666,400 as before. 
Small sampling fractions.—If the number of primary units 


sampled (п) is а small fraction of the total number of primary 
units (N), the standard error formula simplifies to 


rj 
вр = 4 |8 





n 


This reduced formula is usually applied where the ratio n/N is 
Jess than 0.01. In the example above, the sampling fraction for 
primaries was 4/1,500, so we could very well have used the 
short formula. The estimated standard error would have been 


[6/707.1944 _ =» 
= 434 = v 658.9329 


= 23.64 (instead of 23.61 by the longer formula). 


When n/N is fairly large but the number of secondaries (m) 
sampled in each selected primary is only a small fraction of the 
total number of secondaries (M) in each primary, the standard 
error formula would be 


Ре n\ , пзу? 
з= gals (1 x) N ] 

Sample size for two-stage sampling.—For a fixed number of 
sample observations, two-stage sampling is usually less precise 
than simple random sampling. The advantage of the method is 
that by reducing the cost per observation it permits us to obtain 
the desired precision at а lower cost. 

Usually the precision and cost both increase as the number of 
primaries is increased and the number of secondaries (m) per 
sampled primary is decreased. The cost may be reduced by taking 





54 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


fewer primaries and more secondaries per primary, but precision 
usually suffers. This suggests that there is a number (m) of sec- 
ondaries per primary that will be optimum from the standpoint 
of giving the greatest precision for a given amount of money. The 
value of m that is optimum depends on the nature of the popula- 
tion variability between primaries and among-secondaries within 
primaries, and on the relationship between the cost per primary 
and the added cost per secondary. 

The population variability between primaries is symbolized by 
с? and the variability within primaries by ои“. Note that these 
are population values, not sample values. Occasionally we will 
have some knowledge of с and ви? from previous work with the 
population. More often, it will be necessary to take a preliminary 
sample to estimate the population variabilities. From this pre- 
sample, we compute за? and зи? according to the formulae in the 
discussion of the error of a two-stage sample. Then our estimates 
of the population variability within and between primaries are 


ап? = 8° 
"EE 8w? 
m 
The cost of locating and establishing a primary unit (not count- 
ing overhead costs) is symbolized by с. The additional cost of 
getting to and measuring a secondary unit after the primary has 
been located is symbolized by с,. 
Given the necessary cost and variance information, we can 
estimate the optimum size of m (say то) by 
Мат ToN 
=|( Sie) (Se 
e) (е) 
ТЕ m, is greater than the number of secondaries per primary (M), 
the formula value is ignored and m, is set equal to M. 
Once m, has been estimated, the number of primary units (with 
т, secondaries per primary) needed to estimate the mean with a 
specified standard error (D) is 








Where: № = Total number of primaries in the population. 
М = Total number of secondaries per primary. 
Numerical ezample, —Suppose that we wish to estimate а popu- 
lation mean with a standard error of 10 percent or less. We have 
defined the population as being composed of N — 1,000 primaries 
with M — 100 secondaries per primary. 


Е 


ma mg пи за за US ош за ва 


Ge 


ва 


Б 


ва 


с 





a m 


а за за gu 


ва ва 


ма ва sn 


Né) ш вО ва 


z3 


ELEMENTARY, FOREST SAMPLING 55 


As we know nothing of the variability between or within these 
primaries nor about the costs, we take а preliminary sample 
consisting of eight primaries with two secondaries per primary. 
Results are as follows: 


Data from preliminary survey 











Observed values Primary 

Primary ој secondaries total 
1. 34 42 16 
— 36 17 53 
8. 41 56 97 
4 62 40 102 
5 82 94 176 
6 16 88 54 
7 22 41 68 
8. 98 50 148 
Total = 764 


From this preliminary sample, we compute 


85" = 981.8571 
Зу? = 248.2500 
164 


0 = 716 = 4775 


Therefore, Ше estimates of the population variances between and 
within primaries are 


би? = Въ? = 248.25 
812 — sw" _ 981.8571 — 248.2500 


of = 
d m 2 


= 366.8036 


Assume also that the preliminary sample yields the following cost 
estimates : 


с, = $14.00 
е, = $ 1.20 


Then our estimate of the optimum number of secondaries to be 
observed in each primary is 








56 AGRICULTURE HANDBOOK 282, U.S. DEPT. OF AGRICULTURE 


Since we can't observe а fraction of а unit, we must now decide 
whether to take two or three secondaries per primary. To do this, 
we estimate the number of primaries needed for ап m of 2 and for 
an т of 3, compute the cost of the two alternatives and select the 
less expensive one. 

Our preliminary sample gave an estimate of the mean of 47.75 
and, since we have specified а standard error of 10 percent, this 
means we want D = (0.10) (47.75) = 4.775 or 48. 

If m — 2, the number of primaries needed for the desired pre- 
cision would be 














(re) 
п i p 
NUM 
(366.8086 + HE) 
= 1 248.25 
(4.8)? + ggg (366.8036 + 200 ) 
_ 490.9286 
= 534098 
= 20.97 
or, п = 21 


There will be 21 primaries at a cost of $14 each and 2(21) = 42 
secondaries at a cost of $1.20 each, so that the total survey cost 
(exclusive of overhead) will be $344.40. 

If m — 3, the number of primaries will be 


(зве возо + 285) 


п = 





саву LÀ 24825 
(4.8) + ооо (366086 + s ) 
449.5536 _ 
= 33.4093 = 19.20 
ог, n = 20 


The cost of this survey will be 
20(14.00) + 60(1.20) = 352.00 


As the first alternative gives the desired precision at a lower cost, 
we would sample n = 21 primaries and m = 2 secondaries per 
primary. 

Systematic arrangement of secondaries.—Though the potential 
economy of two-stage sampling has been apparent and appealing 


за а mi 


my qu US за аш 


my па 


I] 


CS 


а сы 


СО 








Eg 


ma во su 


q 


ELEMENTARY FOREST: SAMPLING 57 


to foresters, they have displayed a-reluctance to select secondary 
units at random. Primary sampling points may be selected at ran- 
dom, but at each point the secondaries will often be arranged in 
а set pattern. This is not two-stage sampling in the sense that 
we have been using the term, though it may result in similar in- 
creases in sampling efficiency. It might be called cluster sampling, 
the cluster being the group of secondaries at each location. The 
unit of observation then is not the individual plot but the entire 
cluster. The unit value із the mean or total for the cluster. Esti- 
mates and their errors are computed by the formulae that apply 
to the method of selecting the cluster locations. 

Within each primary the clusters should be selected во that 
every secondary has a chance of appearing in the sample” If cer- 
tain portions of the primaries are systematically excluded, bias 
may result. 


Two-Stage Sampling With Unequal-Sized: Primaries 


The two-stage method of the previous chapter gives the same 
weight to all primaries. This hardly seems logical if the primaries 
vary greatly in size. It would, for example, give the same weight 
to a 10,000-acre tract as to a 40-acre tract. There are several modi- 
fied methods of two-stage sampling which take primary size into 
account. 

Stratified two-stage sampling.—One approach is to group equal- 
sized primaries into strata and apply the standard two-stage meth- 
ods and computations within each stratum. Population estimates 
are made. by combining the individual stratum estimates accord- 
ing to the stratified sampling formulae. This is a very good design 
if the size of each primary is known and the number of strata is 
not too large. If the number of primaries is small, it may even be 
feasible to regard each primary as a stratum and use regular 
single-stage stratified sampling. 

Selecting primaries with probability proportional to size.—An- 
other possibility is to select primaries with probability propor- 
tional to size (pps) and secondaries within primaries with equal 
probability. Selection of primaries must be with. replacement, but 
secondaries can be selected without replacement. A new set of 
secondaries should be drawn each time that a given primary is 
selected so that а secondary that was selected during one sampling 
may again be selected during some subsequent sampling of that 
primary. 

After the observations have been made, the sample mean (M) 
18 computed for each of the п primaries included in the sample. 
These primary means are then used to compute an estimate of the 
population mean by 


58 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 
The standard error of the mean is 


2 NE 
H (Se 
2 


‹ 
n(n — 1) 





Зу = 


If only one secondary is selected in each selected primary, this 
procedure becomes identical to simple random sampling. 

If there is any relationship between the primary size and its 
mean, pps sampling may give estimates of low precision."The pre- 
cision can be improved by combining stratified two-stage sampling 
and pps selection of primaries. Primaries of similar size are 
grouped into strata and within each stratum selection of primaries 
is made with probability proportional to size. Strata means and 
variance are computed by the formulae for two-stage sampling 
with pps selection of primaries. 

Selection of primaries with equal probability—The procedures 
that have been discussed so far require reasonably accurate infor- 
mation about the size of each primary in the population—infor- 
mation that is often lacking. An alternative technique requires 
knowledge only of the size of the primaries actually included in 
the sample and of the total number of primaries in the population. 
The method involves selection of n primaries and m, secondaries 
within the it selected primary. At each level, sampling is with 
equal probability and without replacement. The number of second- 
aries sampled (т) may vary or remain constant. The sample 
primary mean (ў) is computed for each selected primary and 
from these the population mean 1з estimated as 


5 (Ma) 


where: n = Number of primaries sampled. 
$: = Mean per secondary in the iè sampled primary. 
M, = Total number of secondary units in the i sampled 
primary (this can be an actual or a relative measure 
of size). 


d= 


The standard error of this estimate is 








_, [n 5 Ма х Тё 23 MT, т 
на (био Ts во) (1 x) 


where: = Number of primaries sampled. 
N = Total number of primaries. 
T, = (Md) 


E 








| 
| 
a 





а 


ELEMENTARY FOREST SAMPLING 59 


For an illustration of the computations, suppose that we wished 
to estimate the mean board-foot volume on a population of 426 
woodlots. Four woodlots (primaries) are selected at random, and 
within each woodlot the board-foot volume is measured on two 
randomly selected one-fifth-acre plots. For each woodlot selected, 
the acreage is also determined. Since one-fifth-acre plots were 
used, the value of M, for the i woodlot will be 5 times its acreage. 
Assume the observed values are as follows: 


Primary 
Sampled means  Woodlot 
woodlot Plot values. [DU acreage м, Md = Ty 
Ba. ft. Ba, ft. 

1 620 740 680 110 550 374,000 
2.5 585 475 530. 26 130 68,900 
8 590 180 660 54 270 178,200 
4. 960 820 890 60 300 267,000 








1,250 888,100 








_ (М) _ 888,100 — я 
I= см) = 1250 = 710.48 board feet per fifth-acre plot. 


The values needed to compute the standard error are 


XM? == 482,300 XT? = 247,667,450,000 
ЗМТ, = 342,871,000 
(М)? = 1,562,500 (Те? = 788,721,610,000 


(ЕМ) СЕТО = 1,110,125,000 
Hence, 


Е Г4\/ 482,300 , 247,667,450,000 
е 14948 NC i  T88/121,610,000 


— 2) (2,871,000) fpe 
1,110,125,000 ignored 
= 110.48 \/0.00662295 
= 51.82 board feet. 


This estimate of the mean will be slightly biased if there is апу 
relationship between the primary size and the mean per unit in 
that primary. The bias is generally not serious for large samples 
(more than 30 primaries). 

An unbiased. equal-probability estimator.—If the bias incurred 
by use of the above estimator is expected to Бе large, an unbiased 
estimate can be obtained. In addition to the information required 
for the biased procedure, we must also know the total number of 
secondaries (M) in-the population. As in the case of the biased 
estimator, n primaries-are selected with equal probability and 
within each primary m; secondaries are observed. The mean рег 





60 AGRICULTURE HANDBOOK 282, U.S. DEPT. OF AGRICULTURE 


unit (%) is computed for each primary and used to estimate the 
population mean 


The standard error of the mean is 


Ísarg): _ @ М8 
a zy- —— (1-я) 


Now, assume that the 426 woodlots of the previous example 
have a total area of 26,412 acres. Then, because the secondary 
units are one-fifth acre in size, the total number of secondaries in 
the population is M = 132,060. With the same sample data the 
unbiased estimate of the population mean per unit would be 





I= у (888,100) = 716.21 board feet per plot. 


426 
4 (132,060 


The standard error is 





426 | (874,000 +... + 267,0002) — 
132,060 13) 


(888,100)? 
4 


fpc 
ignored 








5; 


= 0.003226 \/4,207,258,958 
= 209.25 board feet. 


The standard error of the unbiased estimate (209.25) as com- 
pared to that of the biased estimate (57.82) shows why the latter 
is often preferred. But, if the size of all primaries is known, the 
bias of the biased estimator can be reduced and the precision of 
the unbiased estimator increased by grouping similar sized pri- 
maries and using these estimating procedures in conjunction with 
stratified sampling. 


Systematic Sampling 


As the name implies, and as most foresters know, the units in- 
cluded in a systematic: sample аге selected. not at random but 
according to a pre-specified pattern. Usually; the only element of 
randomization is in the selection of the starting point of the pat- 
tern, and even that is often ignored. The.most common pattern 
is a grid having the sample units. in equally spaced rows with a 
constant distance between units within rows. 


ши m 


с 


"e с ми 


ва 


99 


ош m my um 


Ге 


fissi 


БЕ 


с 





= ва sS) 


ELEMENTARY FOREST SAMPLING 61 


To the disdain of some statisticians, the vast majority of forest 
surveys have been made by some form of systematic sampling. 
There are two reasons: (1) the location of sample units in the 
field is often easier and cheaper, and (2) there is a feeling that 
a sample deliberately spread over the entire population will be 
more representative than а random sample. 

Statisticians usually will not argue against the first reason. 
They are less willing to accept the second. They admit the роз- 
sibility, sometimes even the probability, that a systematic sample 
will give a more precise estimate of the true population mean 
(i.e., be more representative) than would a random sample of the 
same size. They point out, however, that estimation of the samp- 
ling error of a systematic survey requires more knowledge about 
the population than is usually available, with the result that the 
sampler can seldom be sure just how precise his estimate is. The 
common procedure is to use random sampling formulae to com- 
pute the errors of a systematic survey. Depending on the degree 
and the way in which the population falls into patterns, the pre- 
cision may be either much lower or much higher than that sug- 
gested by the random formulae. If there is no definite pattern in 
the unit values in the population, the random formulae may give 
а fair indication of the sampling precision. The difficulty is in 
knowing which condition applies to a particular sample. 

The well-known procedure of superimposing two or more sys- 
tematic grids, each with randomly located starting points, does 
provide some of the advantages of systematic sampling along with 
a valid estimate of the sampling error. In this procedure each grid 
becomes, in effect, a single observation and the error is estimated 
from the variability among grids. Locating plots in the field be- 
comes more difficult as the number of grids increases, however, 
and it would seem as though the advantage of representativeness 
could be obtained more easily and efficiently by stratified sampling 
with small blocks serving as strata. 

Despite the known hazards, foresters are not likely to give up 
systematic sampling. They will usually take the precaution of 
running the lines of plots at right angles rather than parallel to 
ridges and streams. In most cases, sampling errors will be com- 
puted by formulae appropriate to random sampling. Experience 
suggests that a few of these surveys will be very misleading, but 
that most of them will give estimates having precision as good 
as or slightly better than that shown by the random sampling 
formulae. Some statisticians wil] continue to bemoan the practice 
and a few of them will keep searching for a workable general solu- 
tion to the problem of error estimates (though at least one very 
eminent statistician doubts that a workable solution exists). 


SAMPLING METHODS FOR DISCRETE VARIABLES 


Simple Random Sampling—Classification Data 


Assume that from a large batch of seed 50 have been selected 
at random in order to estimate the proportion (p) that are sound. 





62 AGRICULTURE HANDBOOK 282, U.S. DEPT. OF AGRICULTURE 


Assume also that cutting or hammering discloses that 89 of the 50 
seeds уне sound. Then our estimate (р) of the proportion that 
18 sound 18 


ps Number having the specified attribute 
T Number observed 





5:88 
= 50 
= 0.78 


Standard error ој estimate.—The estimated standard error of 
р is 





where: n = number of units observed. 
In this example N is extremely large relative to », and so the 
finite-population correction could be ignored 





= 0.05918 


Confidence limits.—For certain sample sizes (among them, n 
== 50), confidence limits can be obtained from table 3, page 87. 
In this example we found that in a sample of т — 50 seeds, 39 
were sound. The estimated proportion sound was 0.78 and, as 
shown in table 3, the 95-percent confidence limits would be 0.64 
and 0.88, For samples of 100 and larger the table does not show 
the confidence limits for proportions higher than 0.50. These can 
easily be obtained, however, by working with the proportion of 
units not having the specified attribute. Thus suppose that, in а 
sample of n = 1,000 seeds, 78 percent were sound. This is equiva- 
lent to saying that 22 percent were not sound, and the table shows 
that for n = 1,000 the 95-percent confidence interval for an ob- 
served fraction of 0.22 is 0.19 to 0.25. If the true population pro- 
portion of unsound seed is within the limits 0.19 and 0.25, the 
populssion proportion of sound seed must be within the limits 0.75 
and 0.81. 

Confidence intervals for large samples.—For large samples, the 
95-percent confidence interval can be computed as 


в [25+ | 


Assume that a sample of n = 250 units has been selected and that 
70 of these units are found to have some specified attribute. Then, 


70 


= 356 = 0.28 





а ОН аа ви ш 


wa ш ш uS 


с эш mg my ош ш GN 





Eg 


gy 





ELEMENTARY FOREST SAMPLING 63 
And, 
88 = Q8) Q8) (ignoring the finite-population correction) 
= 0.02845 


"Then, the 95-percent confidence interval 


= 0.28 + [20.02845 + э] 


= 0.28 + 0.059 
= 0.221 to 0.339 


Thus, unless а 1-т-20 chance has occurred, the true proportion 
is somewhere within the limits 0.22 and 0.34. For a 99-percent 
confidence interval we would multiply s; by 2.6 instead of 2. (For 
samples of n = 250 or 1,000, the confidence interval could, of 
course, be obtained from table 3. For this example the table gives 
0.22 to 0.34 as the limits.) 

The above equation gives what is known as the normal approxi- 
mation to the confidence limits. As noted, it can be used for large 
samples. What qualifies as a large sample depends on the propor- 
tion of items having the specified characteristic. As a rough guide, 
the normal approximation will be good if the common logarithm 
of the sample size (n) is equal to or greater than 


1.5 -+ 8(|P — 0.5]) 


where: P = our best estimate of the true proportion of the popu- 
lation having the specified attribute. 


IP — 0.5] = the absolute value (i.e., algebraic sign ignored) of 
the departure of P from 0.5. 


Thus, if our estimate of P is 0.20 then |Р — 0.5] is equal to 0.3 
and, if we are to use the normal approximation, the log of our 
sample size should be greater than 


1.5 -+ 3(0.3) = 2.4 


Ог n must be over 251 (2.4 = log 251). 

Sample size.—Table 3 may also be used as a guide to the number 
of units that should be observed in a simple random sample to 
estimate a proportion with a specified precision. Suppose that we 
are sampling a population in which about 40 percent of the units 
have a certain attribute and we wish to estimate this proportion 
to within + 0.15 (at the 95-percent level). The table shows that 
for a sample of size 30 having Р — 0.4 the confidence limits would 
be 0.23 and 0.60. Since the upper limit is not within 0.15 of 
$. = 0.4, a sample of size 30 would not give the necessary pre- 
cision. A sample of n = 50 would give limits of 0.27 and 0.55. As 
each of these is within 0.15 or p = 0.4, we conclude that a sample 
of size 50 would be adequate. 





64 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


1f the table suggests that a sample of over 100 will be needed, 
the size can be estimated by 





s= я 1 for 95-percent confidence 
т т 
(4) (Р) 1-Р) N 

п = — — for 99-percent confidence 


(76) Py —P) N 
where: Е = The precision with which Р 13 to be estimated, 
N = Total number of units in the population. 

The table indicates that to estimate a P of about 0.4 to within 
E = + 0.05 (at the 95-percent confidence level) would require 
somewhere between 250 and 1,000 observations. Using the first of 
the above formulae (and assuming N = 5,000) we would find, 


1 
(05)? 1 
(4) (0.4) (0.6) 5,000 

If we have no idea of the value of P, we will have to make а 
guess at it in order to estimate the sample size. The safest course 
is to guess a P as close to 0.6 as it might reasonably occur. 

How to select a seed at random.—If we were trying to estimate 
the proportion of trees in a stand having a certain disease, it 
would be difficult to select the individual trees at random and then 
locate them in the field for observation. In some populations, how- 
ever, the individuals themselves are randomly located or can easily 
be made во. A batch of seed is such a population. By thoroughly 
mixing the seed prior to sampling, it is possible to select a num- 
ber of individuals from one position in the batch and assume that 
this is equivalent to a completely random sample. Those who have 
sampled seed warn against mixing in such a manner that the light 
empty seeds tend to work together towards the top of the pile. 
The sample could be taken with a small scoop or a seed probe 
which picks up approximately the number of seed to be examined. 
As a precaution, most seed samplers will use a scoop that. selects 
only a fraction of the desired number of seeds and will take 
samples from several places in the pile and combine them. 


n= = 857 








Cluster Sampling for Attributes 


In attribute sampling the cost of selecting and locating an in- 
dividual is usually very high relative to the cost of determining 
whether or not the individual has a certain characteristic. Because 
of this, some form of cluster sampling:is usually preferred over 
simple random sampling. In cluster sampling, a group of individ- 
uals becomes the unit of observation, and the unit value is the 
proportion of the individuals in the group having the specified 
attribute. 

In estimating the survival percent of a plantation it would be 
possible to choose individual trees for observation by randomly 


eu эи иш 


ва 


= ш аш GN 


ощ mea gg GN 


en ug 





-3 


Bu 


mh an mg ша 


ELEMENTARY FOREST SAMPLING 65 


selecting pairs of numbers and letting the first number stand for 
а row and the second number designate the tree within that row. 
But it would'obviously be inefficient to ignore all of the trees that 
must be passed to get to the one selected. Instead, we would prob- 
ably make survival counts in a number of randomly selected rows 
and (assuming the same number of trees were planted in each 
row) average these to estimate the survival percent. This is а 
form of cluster sampling, the cluster being a row of planted trees. 

The germination percent of а batch of seed might also be esti- 
mated by cluster sampling. Here the advantage of clusters comes 
not in the selection of individuals for observation but from avoid- 
ing some hazards of germination tests. Such tests are commonly 
made in small covered dishes. If all the seeds are in a single dish, 
any mishaps (e.g. overwatering or fungus attack) may affect 
the entire test, To avoid this hazard, it is common to place a fixed 
number of seeds (one or two hundred) in each of several dishes. 
The individual dish then becomes the unit of observation and the 
unit value is the germination percent for the dish. 

When clusters are fairly large and all of the same size, the pro- 
cedures for computing estimates of means and standard errors 
are much the same as those described for measurement data. To 
illustrate, assume that 8 samples of 100 seeds each have been 
selected from a thoroughly mixed batch. The 100-seed samples are 
placed in 8 separate germination dishes. After 30 days, the follow- 
ing germination percentages are recorded: 


Dish No. |1 2 3 4 Б 6 7 8 | Total 
Germination (ре) | 84 88 86 76 81 80 85 84 | 664 





If p, is the germination percent in the ќе dish, the mean ger- 
mination percent would be estimated by 





The variance of р would be computed by 





А 
») (842 + 88? 4... + 84) — (8807 











реза 
* (n — 1) 7 
= 14.5714 


Whence the standard error of р can be obtained as 


ERED) 


= „рам = 1.35 (ignoring the finite-population correction) 





66 AGRICULTURE HANDBOOK 232, Џ.8. DEPT. OF AGRICULTURE 


Note that, in cluster sampling, n stands for the number of clusters 

sampled and N is the number of possible clusters in the population. 
As in simple random sampling of measurement data, a confi- 

Aue interval for the estimated percentage can be computed by 
tudent's t 


95-percent confidence interval = p + t (8,) 


Where: t = Value of Student's # at the 0.05 level with n — 1 
degrees of freedom. Thus, in this example, t would have 7 degrees 
of freedom and tos would be 2.365, The 95-percent confidence 
interval would be 


83.0 + (2.865) (1.85) = 83.0 = 3.19 
= 79.8 to 86.2 


Transformation ој percentages.—If clusters are small (less 
than 100 units per cluster) or if some of the observed percent- 
ages are greater than 80 or less than 20, it may be desirable to 
transform the percentages before computing means and confidence 
intervals, The common transformation is aresin percent. Table 
4, page 89, makes it easy to transform the observed percentages. 
Por thè data in the previous example, the transformed values 
wou. е 





Dish Мо. Percent. Arcsin Dish No. Percent Are⸗ in 
1....... 84 66.4 ву. 63.4 
г. . 88 69.7 1, 67.2 
8. . 86 68:0 B ....... 66.4 
4 . 76 60.7 — 
5. . 81 64.2 | Total .......... 526.0 

The mean of the transformed values is 
526.0 
~g = 65.15 
The variance of these values is 
2 
(66.4 +... + 66.42) — 528)" 
8? == 8.1486 








7 


And the standard error of the mean transformed value is 








5; = 080 = VIOI = 1.009 


So the 95-percent confidence limits would be (using £o; for 7 df's 
.365) 


СІ = 65.15 + (2.865) (1.009) = 65.75 + 2.39 
= 63.36 to 68.14 


ши ша 


ка 


ма аш 


ва 


ва 


ва кш па 


ва 


с 








3 








-3 


= 


ELEMENTARY FOREST SAMPLING 67 


Referring to the table again, we see that the mean of 65.75 
corresponds to a percentage of 83.1. The confidence limits cor- 
respond to percentages of 79.9 and 86.1. In this case the trans- 
formation made little difference in the mean or the confidence 
limits, but in general it is safer to use the transformed values 
even though some extra work is involved. 

Other cluster-sampling designs.—If we regard the observed or 
transformed percentages as equivalent to measurements, it is easy 
to see that any of the designs described for continuous variables 
can also be used for cluster sampling of attributes. In place of 
individuals, the clusters become the units of which the population 
is composed. 

Stratified random sampling might be applied when we wish to 
estimate the mean germination percent of а веед lot made up of 
seed from several sources. The sources become the strata, each 
of which is sampled by two or more randomly selected clusters 
of 100 or 200 seeds. 

With seed stored in a number of canisters of 100 pounds each, 
we might use two-stage sampling, the canisters being primary 
sampling units and clusters of 100 seeds being secondaries. If the 
canisters differed in volume, we might sample canisters with 
probability proportional to size. 


Cluster Sampling for Attributes—Unequal-Sized Clusters 

Frequently when sampling for attributes, we find it convenient 
to let a plot be the sampling unit. On each plot we will count the 
total number of individuals and the number having the specified 
attributes. Even though the plots are of equal area, the total num- 
ber of individuals may vary from plot to plot; thus, the clusters 
will be of unequal size. In estimating the proportion of individuals 
having the attribute, we probably would not want to average the 
proportions for all plots because that would give the same weight 
to plots with few individuals as to those with many. 

In such situations, we might use the ratio-of-means estimator. 
Suppose that 2,4,5-T ваз been sprayed on an area of small scrub 
oaks and we wish to determine the percentage of trees killed. To 
make this estimate, the total number of trees (г.) and the number 
of dead trees (џ.) is determined on 20 one-tenth-acre plots. 





Plot No. of No. of dead | Plot No. of No. of dead. 
trees (®) trees (v) trees (а) menm (а) 
1 15 11 |13 26 16. 
2 42 32 |14 160 126 
3 128 98 |15 103 80 
4 86 42 116 80 58 
5 97 62 |17 32 25 
6 8 6 |18 56 44 
T 28 22 |19 49 24 
8 65 51 |20 84 59 
9 11 48 — — 
10 110 66 Total .. 1,851 960 
11 63 58 
12 48 32 Mean .. 67.55 48.0 





68 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 
The ratio-of-means estimate of the proportion of trees killed is 


24.480 . 
== 5755 = 01106 


The estimated standard error of р is 


„= ДЕРЕ Be) (1- x) 


Where: 8,2 = Variance of individual y values. 
Variance of individual г values. 
Covariance of y and z. 

n = Number of plots observed. 
In this example 














(P сви... + 58) — 380: 
mil * = 892.6316 

(1824-4224... 84) -15E 
8,2 = ~ = 1,542.4711 

19 

(11) (15) + (82) (42) +... + (69) (84) — 989) 1851) 

ips 
19 





= 1,132.6316 
With these values (but ignoring the fpc), 





— 2(0.7106) (1,132.6316) 
20 





er + (0.7106)? ai | 





As in any use of the ratio-of-means estimator, the results may 
be biased if the proportion of units in a cluster having a specified 
attribute is related to the size of the cluster. For large samples, 
the bias will often be trivial. 


Sampling of Count Variables 


Statistical complieations often arise in handling data such as 
number of weevils in a cone, number of seedlings on a one-tenth- 
milacre plot, and similar count variables having no fixed upper 
limit. Small counts and those with numerous zeroes are especially 
troublesome. They tend to follow distributions (Poisson, negative 
binomial, etc.) that are difficult to work with. If count variables 
cannot be avoided, the amateur sampler's best course may be to 











= 


EX 


E 


má ma gu gn ЕЕ S) BS 





ELEMENTARY .FOREST SAMPLING 69 


define the sample units so that most of the counts are large and 
to take samples of 30 units or more. It may then be possible to 
apply the procedures given for continuous variables. 

In order to estimate the number of larvae of a certain insect 
in the litter of a forest tract, one-foot-square litter samples were 
taken at 600 randomly selected points. The litter was carefully 
examined and the number of larvae recorded for each sample, The 
counts varied from 0 to 6 larvae per plot. The number of plots on 
which the various counts were observed were 


Count = 0 1 2 3 4 5 6 Total 
Number 
of plots = 256 224 92 21 4 1 2 600 


The counts are very close to a Poisson distribution (see page 
6). To permit the applications of normal distribution methods, 
the units were redefined. The new units were to consist of 15 of 
the original units selected at random from the 600. There were 
to be a total of 40 of the new units, and unit values were to be 
the total larvae count for the 15 selected observations, The values 
for the 40 redefined units were 


14 18 16 18 13 14 15 12 


16 18 11 T. 9 10 11 10 
12 14 13 14 14 18 9 17 
15 8 12 5 13 15 18 10 
12 12 20 10 9 14 15 18 

Total = 504 


By Ше procedures for simple random sampling of а continuous 
variable, the estimated mean (7) per unit is 


The variance (s,?) is 


(142 --162 4... 199) — eo 
39 


2— 
Sy 


= 8.8615 





With correction for finite population ignored, the standard error 
of Ше mean (8,) is 





70 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


The new units have a total area of 15 square feet; hence to 
estimate the mean number of larvae per acre the mean per unit 
must be multiplied by 

43,560 


TO = 2,904 


Thus, the mean per acre is 
(2,904) (12.6) = 36,590.4 
The standard error of the mean per acre is 
(2,904) (0.47) = 1,364.88 
As an approximation we can say that unless a 1-in-20 chance 
has occurred in sampling, the mean count per acre is within the 
limits 
36,590.4 + 2(1,364.88) 
ог, 
33,860 to 39,320 


SOME OTHER ASPECTS OF SAMPLING 


Size and Shape of Sampling Units 

The size and shape of the sampling unit may profoundly affect 
the cost of the survey, its precision, or both. No attempt will be 
made here to offer an exhaustive study, but an example may illus- 
trate the problem and a general approach to its solution. 

Consider a preharvest inventory in a nursery containing 1,000 
beds of slash pine, each bed 500 feet long and 4 feet wide. Con- 
ventional practice in this nursery has been to sample the beds by 
observing the total number of plantable seedlings in a 1- by 4-foot 
sampling frame laid crosswise at five randomly chosen locations 
in each bed. The process is laborious and time consuming, totaling 
5,000 observations, or nearly a mile of bed. The nurseryman would 
like to know if a frame 6 inches wide would be better than the 
conventional 12-inch frame. 

One practical way to judge among sampling units is to compare 
the total cost of surveys made with each unit, with the restriction 
that both methods shall afford equal precision. For example,’ if 
the cost per observation with the 6-inch frame is dı, then for n, 
observations the cost of the survey (exclusive of overhead costs, 
which are assumed to be the same for both size units) is 


ат fd, 
Similarly, for the 12-inch frame, we can say 
ез = та 





For illustrative purposes the nursery survey will be treated as а simple 
random sample, though the specification of a set number of plots in each bed 
makes it a stratified design, 


мака 


EE 


[M 








ЊЕ, mx) 


ш ч ша 


ELEMENTARY FOREST SAMPLING 71 


Then the cost of the 6-inch frame relative to the cost of the 12- 
inch frame is 


а та 
& Madly 


If estimates of population variance 8,? and s;? are available, vari- 
ance of the population totals (ignoring the fpc) may be written 


2 
вт? = № (2) 


5 


where: N, and №, = Number of units of each size in the popula- 
tion. Now if the two methods are to give equal precision for the 
estimate of total production, 


and 


8112 = 8,2 


„(м 


ог, 


and, solving for na 





This last quantity may be simplified by remembering that the total 
number of 6-inch units (№) is twice the total number of 12-inch 
units (N;) ; hence, 


№2282 
i (туа) ш 


Ei we substitute this value of mz in the relative cost formula given 
above 





In this example, a special study showed 812 and s;? to be 184.1 and 
416.0 respectively; апі ће average times for locating the frame 
and making the count for each size of frame were found to be 


72 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


d; = 94.36 and d, = 129.00. Substituting these values in the 
equation for relative cost, 


cı _ 403841) (94.36) 
ĉa — (416.0) (129.00) 
= 0.943 


This result indicates that the 6-inch frame is slightly more efficient 
than the 12-inch frame. 
In more general terms the cost of method 1 relative to the cost 
of method 2 for a specified sampling error would be 
в _ Мада, 
а Nasada 
The same result is obtained by thinking in terms of the relative 
efficiency of the alternative procedures. As a measure of efficiency, 
statisticians commonly use the reciprocal of the product of the cost 
per unit and the squared coefficient of variation for the given 
sample unit. If the coefficient of variation is symbolized by C and 
the cost by d, the efficiency (U) is given by 





1 
C= wor 
The relative efficiency of two alternatives would then be 


U: (d) (С) or U = (da) (C2)? 
по (da) (C)? Us (4) (С)? 
In Ше previous example we had 





d, = 94.36 812 — 134.1 
д» = 129.00 8% = 416.0 
For the 6-inch frame the squared coefficient of variation is 
8\2 
(С1)° = за 


Tor the 12-inch frame the squared coefficient of variation would 
e 


(C2)? = 5 


The mean per unit for the 12-inch frame (#2) should be twice the 
mean per unit for the 6-inch frame, so that we can write 


— — 
(02* = Gaye = Чая 
Then the efficiency of the 12-inch frame relative to that of {һе 6- 
inch frame is 
Uz _  9436(1841/2,) . 4(94.86) (134.1) 
U,  129.00(416.0/42,7) ~ (129.00) (416.0) 
= 0.948 








НЕ me НЕ ше иш при ПО пи 


GS 


€ 


с 


C 





f 


ва 


B su оо ва ва 





ELEMENTARY FOREST-SAMPLING 78 


Ав before, the 6-inch frame appears more efficient than the 12- 
inch frame. 


Estimating Changes 


Changes that have taken place in the characteristics of a forest 
population are often of as much interest as their present status. 
Periodic change in stand volume is, for example, a major concern 
of foresters. 

Estimating such changes usually requires sampling at the be- 
ginning and end of the period. The difference or some function of 
the difference between the two estimates is the estimated change. 
Ordinarily the same sampling method will be used each time, but 
that is not absolutely necessary. 

Temporary or permanent plots.—Estimating change by samp- 
ling at two different times always raises the question of temporary 
or permanent sample plots. That is, should an entirely new set of 
units be randomly selected for observation at each time, or should 
the same units be observed at both times? A third alternative is 
to have some temporary and some permanent plots in a double 
sampling system: a large sample of temporary plots with a sub- 
sample of permanent plots. 

The choice between temporary and permanent plots depends 
heavily on the degree of correlation that can be expected between 
the initial and final plot values. If a high positive correlation is 
expected, permanent plots should give the better precision. If the 
correlation is likely to be low or negative, temporary plots might 
be better. If the period is relatively short and if cutting or heavy 
mortality is unlikely, the correlation probably will be large and 
positive, favoring the use of permanent plots. Where large volume 
changes are likely to occur because of cutting, heavy mortality, 
or a very long time interval, the correlation will be small or even 
negative, favoring the use of temporary plots. 

If there is:enough information on cost and variability, the ad- 
vantage of permanent plots with simple random. sampling can be 
weighed by computing.the relative cost (R.) of obtaining a given 
precision by the two methods. 


R, „Сиво + 822) 
* = €, (si +82 — 255) 
where: C, — Cost of locating and making a single measurement 

on а temporary plot. 

С, = Total cost of locating, measuring, monumenting, re- 
locating, and remeasuring a permanent plot. 

812 = Variance among individual plots at the time of the 
first measurement, 

82? == Variance among individual plots at the time of the 
Second measurement. 

812 = Соуапапсе between the first and second measure- 
ments on individual plots. 





14 AGRICULTURE HANDBOOK 282, U.S. DEPT. OF AGRICULTURE 


If R, is greater than 1, permanent plots should be used. If R, is 
less than 1, temporary plots will probably be better. Where re- 
measurements will be made several times, the average cost per 
permanent plot will be reduced, swinging the ratio more favor- 
ably towards permanent plots. 

Plot monumentation.—The question of kind and degree of plot 
monumentation has been hotly debated among the users of per- 
manent plots. Where any form of stand treatment is likely to 
take place between measurements, it is generally conceded that the 
plot location and form of monumentation should not be discernible 
to those who make the stand treatments. It is very difficult, if not 
humanly impossible, to avoid treating plot areas differently from 
nonplot areas. At the same time, if the monuments are too cleverly 
concealed, relocation costs will be increased and some plots may 
not be found at all. Because the difficulty of plot relocation is 
likely to be related to stand conditions that are in turn related 
to growth, failure to relocate plots could slightly bias the estimates. 

Sampling errors.—If the mean per unit at the time of the first 
measurement is 7, and the mean per unit at the time of the second 
meaturement is 7, the estimated periodic change per unit is (92 
=p). 

With temporary plots, the standard error of the estimated 
change would be 


АРЕС 
За- = ди“ 8 


where 5,7 and s,,? are the squared standard errors of the mean 
at the time of the first and second measurements. 'The method of 
computing 5;,? and 35,2 would be that appropriate to the particu- 
lar sampling method used. 

With permanent plots, the easiest procedure for computing the 
standard error is to work with the individual differences. Thus, if 
Yu Stands for the first measurement of the ќе permanent plot and 
Ya Stands for the second measurement.on that plot, then d, = 
(Yor — Yu). The standard error of the mean difference is computed 
from the d, values with the formula appropriate for the particu- 
lar sampling method. 


Examples.—The above computations will be illustrated for a 
simple random sample. 


Temporary Plots 
Initial observations: n = 8 
Yu = 12, 24, 27, 14, 16, 10, 21, 30 


Би = 154 9. — 19.25 


за? = 58.9286 sn? 





г- 


ез 


mE аш ша эш ша 


по иш иш NN ш 


= 








ша EM 


= за ша 


z3 


Ба 


са 


ELEMENTARY FOREST SAMPLING 76 


Final observations: п = 8 


Уз == 21, 18, 22, 33, 14, 26, 16, 24 





2 Yas = 180 = 22.50 
2 n. 
8,2 = 40.0000 виз = TE. 500 


Then the estimated mean difference is 
(фе — Jı) = (22.50 — 19.25) = 3.25 


The standard error of the mean difference is 


365-4) = VOTE 5.00 
= 3.43 


Permanent Plots 
Permanent Plot No. Р 


3 4 5 6 7 8 Sym Mean 


1 s 
Initial observations (yi)... 24 14 16 27 10 30 12 21 154 19.25 
Final observations (ya)..... 26 18 22 27 14 38 16 24 180 22.50 


Differences (di = y —4) 2 4 6 0 4 3 4 8 26 825 





The estimated mean difference is 
(9: — 4) = d = 3.25 


The standard error of the mean difference is calculated from the 
d, values with the formula for a simple random sample. 


5 
(22 4-42 + +3) 29 


7 


= 3.0714 
s= = 062 
Design of Sample Surveys 


It has been the purpose of this handbook to treat опу one seg- 
ment of the design of sample surveys, that of the sampling method 


рите ма ттт А ы а исо < 


76 AGRICULTURE HANDBOOK 232, U.S. DEPT, OF AGRICULTURE 


and associated computational procedures. These are the aspects 
of sampling that seem to be most troublesome to foresters. But 
several other phases of survey design also deserve attention. Some | 
of the points that should be considered in planning a survey are | 
summarized here. 


с 


The objective must be stated.—Specifically, identify Ше рагат- ы 
eter to be estimated and Ше precision desired. An example of a 
lucid objective might be: “То estimate the number of plantable 
slash pine seedlings at the Riedsville Nursery. The estimate should |: 


be within 1 percent of the true number, with 95-percent сопй- 
dence.” Vague statements (“То study the results of spraying...” 
“То estimate the effectiveness of...") can and do result in an 
appalling waste of survey efforts. 

The population should be defined.—What are the units consti- 
tuting the population? What are the unit values? What units are 
excluded from the population? Careful, accurate answers to these 
questions will forestall numerous difficulties at later stages. А 
generality worth repeating is that sampling design will be simpli- 
fied if the specifications for the units used to define the population 
are identical with those used in the sample. Even at that, the 
definition and specification may be difficult. It may be easy to de- 
fine a tree or a plot, but if a survey is to be made of farmers, | 
pulpwood contractors, or seed orchards, the unit may be very hard | 
to define. An attempt should be made to foresee the difficulties 
that might arise in classifying a unit as in or out of the popula- 
tion; the borderline instances will be a constant source of trouble 
to enumerators and analysts. 

The data to be collected should. be specified.—Special attention 
must be paid to getting all the data necessary to the objective. 
It is a moot question how far one should go in taking supplemen- 
tary data that is not pertinent to the main objective. Frequently 
cooperators and reviewers, sensing an opportunity to obtain in- 
formation on some pet project, will request that additional obser- | 
vations be made “while you're there.” Such requests must be 
carefully reviewed. “Free” information is not cheap if it is never 
used or has an adverse effect on the main objective of the survey. 

Measurement techniques must be prescribed.—The measurement 
procedures should be stated unambiguously. The detail needed will 
vary with the complexity of the measurements and the experience 
of the personnel, but in general it is better to be annoyingly spe- 
cific than trustingly vague. Terms such as merchantable top, over- 
story, undesirable, stocked, board-foot volume, and plantable 
should be precisely defined. 

The need for training and preliminary practice should be con- 
sidered, And proficiency tests are not unwarranted—even for the 
old hands who may have forgotten some of their earlier training 
or developed bad habits. 

The sampling units must be defined—Again, the totality of 
sampling units, however distributed, must comprise the popula- 
tion. If the unit is obvious, e.g., a sawmill, no particular trouble 
need arise. But if a variety of units:are possible, a search of litera- 


em оч эш НЕ пш EM NN ВИ аш NE ш ш 


C 





GE} 


-3 


me НЕ sm ud am 


guy 


ELEMENTARY FOREST-SAMPLING 77 


ture will frequently uncoyer;some profitable experience; if not, 
а study of the optimum size and shape of sampling unit may be 
required. 

The sampling method must be described.—This. handbook out- 
lines-a number of methods that have been found useful in forestry. 
Thought, experience, and a review of literature will help in de- 
ciding which method is most'appropriate for a particular situation. 
The method of selecting the sample units should be carefully 
stated, and so should the procedure of locating the units in the 
field. Saying that a two-stage design will be used with primaries 
and secondaries selected at random is not enough. How will ran- 
domization be accomplished? And how will the unit be located in 
the field? The possibilities of and antidotes for bias in locating 
units deserve some thought. Timber cruisers will, for example, 
tend to veer away from dense brush and openings when locating 
plots by hand compass and pacing. House-to-house interviewers 
have been known to neglect top-floor apartments and homes with 
barking dogs. 

At this stage it is also well to think out the procedures to be 
used for estimating the parameters and sampling errors. Collect- 
ing data and then asking someone how to use it is a good way to 
lose friends and waste survey money. 

The sample size must be prescribed.—Once the desired pre- 
cision, choice of sampling unit, and method of sampling have been 
Stated it is time to think of the size of sample. The sample should 
be just large enough to give the specified precision, and no larger. 
If the requisite information on costs and variances is available, 
this decision should be made prior to the start of field work. In 
the absence of such information, a preliminary survey may be 
necessary. 

Possible problems of data should be considered.—1f the preced- 
ing steps are meticulously followed, problems arising at the data- 
collection stage are usually those of organization and personnel. 
The greatest single stumbling block is the common failure of 
supervisors to continue training and checking field crews or to 
provide for editing of field forms. Some organizations find it 
worthwhile to make punched-card sorts to check for recording 
mistakes such as trees that are 3 inches in d.b.h. and have 14 logs 
(instead of a 14-inch tree with 3 logs). 

Data processing should be planned.—In most cases, procedures 
for computation and analysis are fixed by the choice of sampling 
methods. In organizing the computing, there may be some extra- 
ordinary considerations that merit early attention. If the volume 
of data is small, computing may be readily absorbed in the daily 
routine. If the volume is large, special staffing and special equip- 
ment may be desirable. If, for example, the analysis is to be on 
electronie computers, it would be advisable to become familiar 
with the special requirements necessary to electronie computing, 
Such as data format for keypunching, availability of programs, 
and cost of programming. 





78 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


REFERENCES FOR ADDITIONAL READING 


Cochran, W. G, 
1953. Sampling techniques. 330 рр. Шив. Wiley, New York. 


Deming, W. E. ' 
1950. Some theory of sampling. 602 pp. illus; Wiley, New York. 


Dixon, W, J., and Massey, F. J., Jr. 
1957. Introduction to statistical analysis, Ed. 2, 488 pp., illus, McGraw- 
Hill, New York. 
Hansen, M. H., Hurwitz, W, N., and Madow, W. G, 
1953. Sample survey methods and theory. Vol. I, 638 рр. illus. Wiley, 
New York. 
Hendricks, W. A. 
1956. The mathematical theory of sampling. 364 pp., illus. Scarecrow 
Press, New Brunswick, N. J. 
Schumacher, F. X., and Chapman, R. A. 
1942. Sampling methods in forestry and range management. Duke Univ. 
School Forestry Bul. 7, 213 pp., illus, 
Snedecor, G. W. 
1956. Statistical methods, Ed. 5, 534 pp, Шив. Iowa State Univ. Press, 
Ames, Ia. 
Sukhatme, P. V. 


1954. Sampling theory of surveys, with applications, 491 pp. illus, Iowa 
State Univ. Press, Ames, Ia. 





Yates, Frank. 


1960. Sampling methods for censuses and surveys, Ed. 2, 440 pp. Шив. 
Hafner, New York. 





weh mb uh up mh эй ий mo ий mn mb mà mp ща ш ш 








PRACTICE PROBLEMS IN SUBSCRIPT AND 
SUMMATION NOTATION 


Values of the Variable zy 
j Classification (7=1,..., 10) 








i Class. 

ification 

1 2 8 4 5 6 7 8 9 10 ification 
1| 6 4 2 0 4 3 629 6 8| 47 
Ве |2! 4 8 4 9 1 4 18 2 1| s 
гз ава ва ва а 1 з 38 
Я 4|10оз? о 0 0 2 а 8| 2 
BT 156 озвтъвзвБбад 40 
Ол 6 | з 75352 4 3 2 6| 4 
i 7/2172 64141 6 4 3| в 





j Classification 
‘subtotals 18 25 29 24 21 23 16 82 28 32 243 





Examples: 
аа сод ааа со 6 азт=2 Фал = 0 
a 10 
5 5а — as ot tiio + Фад ааа 
miei кр) 
‚бы: 8+4+8+...-3) 
cid +8 48+... 
2 Да = (221 + Zaa + Zoa ава + 282 + 233) 
= (448444243842) = 
à 5 а = (ааа раце + гад? + годе) 
= (2 4 02 +4? 4 2) =24 
3/4 үг 
> Ба) = (таз + 224)? + (ава + 254)? 


= (44 2)2 + (248)2-- 186 


80 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


(È Bay = (ns за + os + tao)? 
= (5448 4-2): = 196 

10 E 

E ty = (ааа + ава +... ao) 


= (2+3+...+2) =33 


Бай = (824124682 12) = 148 
2 
(ва) = 29: = 841 
T 
D ty = 243 
$3 
> 2078 = (а г (n. з) & (22,2) (223) 
ыш b (Шш) a) 
- Ж, + (8) (4) +... + (1) (7) = 100 
Живи — га) = (ава — аал) + (252 — 24) 
1 Heee H sio — ало 
“(вен 
= (40 — 20) = 20 
E (ху — ty)? = (0— 1)?+ (2—0)?+ (6 — 3)? 
7 .“ (4 — 8) 
= 188 
Erfa — B+... + 42) — (12+ 07 
7 7 + + 84) = 122 


(ее) (ен) m 
== 1,200 


[5 (ты — Фа) ] = [= E — > «| 
= [40 — 20]? 400 











Ге ea E 





2-8 


"ELEMENTARY FOREST SAMPLING 
= 8 (221) + 30222) +... 32239) 
= B (221 + 242 +... + a10) 
= 3 (Day) = 880) =90 


= (2—6) + (a — 6 
“рова итар 

= (аа + tea ан. + ало) 
Siri B 

= (ж, — 10(6) 

= (20 - 60) — —40 





82 AGRICULTURE HANDBOOK 282, U.S. DEPT. OF AGRICULTURE 















































TABLES 
TABLE 1.—Ten thousand randomly assorted. digits 


:| HI НЕ ИЩЕННЕН 
i HB HEN HET HN E 
ЗГ ИЕ HESSE BUT 
# ШИ ШИШ HIE Hi i 
[t HIE HEP ПЕЕ НИЕ 
{Г Е НН EE HU 3 
ЧЕННІ Hs HI 
i DH HEHE 
НИНЇ ЕВА: 
i HH HE B OH 
HHE П 























10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | 40-44 


42815] 77408| 37390] 76766] 52615] 2141 30268) 181! 
83666] 36028| 28420| 70219] 81369] 41943 47366) 41 


34318] 95108] 72305) 64620] 91314] 89812] 








| 


09248] 67879| 00544] 23410] 12740) 0540] 54440] 32949| 13491] 71698) 73130] 787 








00-04 | 05-09 


96754) 17676] 55650) 44105| 47861) aasas 86679] 23930] 53249] 27082) 99116) 75486 84989) 23476] 52967| 67104) 39495] 39100) 17217] 74073 


82995] 64157] 66164] 41180] 10089] 41757 78258] 96488| 88629| 37281 15669 56689) 35682) 40844) 53256] 81872] 35213) 0840] 34471] 14441 
34357| 88040] 53364] 71726) 45690| 66334) 60332) 22554 90600] 71113] 15696| 10703| 65178] 90637 63110| 17622) 53988) T1087| 84148 11670 


64463] 22662) 65905] 10639! 79365) 67382 29085) 69831| 47058| 08186) 





98614| 15993] 84460] 62846) 598441 149221 48730] 78448) 18167 341 


55165| 7731: 





15884 

















This table is reproduced, by permission of the author and publishers, from table 1.5.1 of Snedecor's Statistical Methods (5th ed.), Iowa State University Press. 


Се 


me па ма UN 


«es a us 


па mb un um ub u$ sb um gib sb ай ub пи cO на sS 





TABLE 1—Ten thousand randomly assorted digits (continued) 
4 4 |7: 


ELEMENTARY FOREST SAMPLING 


Щи и uni HB HH 
ПЕН Ва НІ 





| 
ii 
H 


ЧП in i ET БЕВ 





HEP E HS BEI BN 





Hil Г ШЙ ТШ ШИ 





ИЛИН ИЙГЕ 
Up Hu nm itr пи 








ЯШ ИШ ИШ ШИШЕ 
HH HEIN ШИ ШИГ 





И ТЕВНО ПИКИ 





IBH 38 ШИ BI HP 





i| HEN НИЕ ONE HB ПН 





14 


"НЕ АНЕ НИЕ НН 





к 


HH HHH ИНТЕ НЕ HH 





iy 
à 
1 
ИЕ 
1 
i 
1 
dr 
JE 
i 
H 
: 
НЕ 
: 











ШИ 1 НИЕ НИЕ ИШНИ 





авыш 82885 28588 35 TER уе |“ 


11844| onr 
5.1 of Snedecor’s Statistical Methods (Sth ed.), Iowa State University Press. 


содисей, by permission of the author 


83 


г and publishers, from table 1 


84 AGRICULTURE HANDBOOK 232, U.S. DEPT. OF AGRICULTURE 


TABLE 1--Теп thousand randomly assorted digits (continued) 


i: 


ЕЕ PHP NE Н IH 





HEE PE HEIN DH 





HH НЕ ИИ ПЕЕ HI 





HIT HH ПНР ВИ: 





ПИТ НИЕ EH Hn 





PHI SE HI ЕН 





een 65-69 | 70-74 | 15-79 | 80-84 | 85-89 | 90-04 | 9: 


IE EH Пе ЕШ 





TH BEI EHI "i n 





THE SE HL I UD 





BRETT 


REN HIN HI HI DH 
PEE НЕ HEN HU Hi 








1 of Snedecor's Statistical Methods (5th ed.), Iowa State University Press. 


ИН PEE HIN HIS DE 








ПЕ nm IH ПЕЕ HIS 











+ HE H пп Н 
Hit irt Hi d ИН | 





H 
i 1 7 НЕ EE "HH ШЕ nn 





00-04 | 05-09 | 10-14 | 15-19 | 20-24 ее 40-44 


ПЕЕ PEE HHT TH ИШ 





dif SH SHINE GE 











‘This table is reproduced, by permission of the author and publishers, from table 1.5. 











TABLE 1.—Теп thousand randomly assorted digits (continued) 


ELEMENTARY FOREST SAMPLING 


: HH SHE вен НИЕ Hid 
LE HB IH HU 
[EE HH ИН ИН ИШ 
EHE EHE THERE BER 
Т ТЕШ HB SB 
HEP HP НЕС 
ШШЕ IBN HN: ИНГ 























PE RI s HE HI 





НН НИЕ NE UIN НЕ 





[ШИШ НЕ ШИ HUE HE 
EHEH ИШ HHE 55 ИШ Е ІН i 








"EE HI НЕ НЕ! m 


i ИП ПШ 
ПИ ТЕН Hin ШШ ИНИ 
ИГИ GT HB 














ВНЕ HER Е ЦЕ {И esas 
HHR 18 ШШ HH 5 ДШ 











Ее 




















"This table is reproduced, by permission of the author and publishers, from table 1. 


5.1 of Snedecor's Statistical Methods (Sth ed.), Iowa State University Press. 





86 AGRICULTURE HANDBOOK 232, Џ.8. DEPT. ОР AGRICULTURE 


TABLE 2.—The distribution of t 
Probability 












































^ This table is abridged from table III of Fisher and Yates’ Statistical Tables 
for Biological, Agricultural, and Medical, Research, Oliver and Boyd Lid, 
Edinburgh. Permission has been given by the authors and publishers, 





gag Gl mw m 


Gu 















































































В d ELEMENTARY FOREST SAMPLING 87 
TABLE 8,—Confidence intervals for binominal distribution 
1 95-percent interval 
ea Size of sample, n Size of samplo 
Е р 
a observed 
10 120 din. 250 1000 
9 8yo 4 9 o 
9 450 .01| 0 ° 
з 662 021 3 
7T ва .08| 1 4 
12 748 и аа 5 
19 8112 1 „05| 8 7 
26 8816 -08 8 8 
35 9821 14 074 9 
44 9727 15 .08| 5 10 
55 10032 16 .09| 6 и 
69. 1008 E ao 7 12 
45 и аут 18 
a 20 лав м 
160 139 15 
(вв E апо 16 
(18. 24 17 
* 25 18 
E 19 
2a 
28| 23 
29 2з 
E 2119 24 
з 2819 25 
61 2920 26 
63j16 8% 801 27 
бајт 85 81122 28 
6618 36 эз 29 
6819° 37 3824 30 
торо 88 34% 51 
7220 89 3526 82 
79/21 40 3607 33 
75/22 41 9728 34 
7123 42] 9829 35 
79/24 43 3930 36 
8025 44) 4031 37 
82 46 4132 38 
| 84) 46 39 
59 85 41 40 
62 8728 48 44 41 
164 129 49] 456 42 
66 30 Бој 43 
69 orsi в їз ч 
т 9332 Ба 48 45 
IT 9433 40 ‹6 
---. 76 9534 64 41 47 
-..-.]T8 97135 ББ 51| 48 
1 9836 56 48 49 
83 9937 57, 5344 50 
186 10038 58 5445 51 
|89 100/39 59 5546 52 
Ed 10046 60 ва в 
ion of the au! та from table 1.3.1 
q DS Mata пр РА Sal Dreni Pre * 





88 AGRICULTURE HANDBOOK 232, Џ.8. DEPT. OF AGRICULTURE 


TABLE 8.—Confidence intervals for binominal distribution 
(continued) 


99-percent interval 








Size of sample, n 





15 








ХНавоснно 












RE 


$2259?22988558 











f38828285559 





«лавоюннооо 











Фожлозовенонни нофос 





B2SSSSSALSSSss 









BESSSRSSLSSSSBSRES KS 








BRSAPEEScoraseone 


SRSGRSRESSSSSRLSSSSSRRKLSRES 








ELEMENTARY FOREST SAMPLING 
TABLE 4.—Arcsin transformation (angles corresponding to per- 


89 






















41 
41. 





18 
























41.38 
41.96. 


з [з Та Тв | в | 7 | в То 
0.81) 0.99 1401 1.52 1.62) 1.72 
1.99| 2.07 2.29| 2.36| 2.43| 2.60 
2.60| 2.15 292) 2.981 3.03| 3.09 
8. 8.29 3.44| 3.49) 3.53) 8.58 
3.89] 3.93| 3.97) 4.01 

4.29! 4.83| 4.37) 440 

4.66| 4.69| 4.73) 4.76 

5.00) 5.03| 5.07| 5.10 

5.32| 5.85! 5.38| 5.41 

5.62| 5.65 5.68) 5.71 

727| 7.49) 7.71) 792 

946| 9.63] 981 











90 AGRICULTURE HANDBOOK 232, U.S. DEPT. ОР AGRICULTURE 


TABLE 4.—Arcsin transformation (angles corresponding to per- 
centages, angle = arcsin y percentage) (continued) 





в То 


4 


| 5 


6 


[т 














ag 
зва 


15.00 






































ELEMENTARY FOREST SAMPLING 91 


TABLE 4.—Aresin transformation (angles corresponding to per- 
centages, angle = arcsin \/pércentage) (continued) 
o{1{[2{/3f]4;]5]6]7{s8{o9 

















$$$ 
ium Rino | = 


8 88825 38 


- 





This table is reproduced, by permission of the author and publishers, from 
table 11.12.1 of Snedecor's Statistical Methods (ед. 5), Iowa State University. 
Press. Permission has also been granted by the original author, Dr. C. I. Bliss, 
of the Connecticut Agricultural Experiment Station. 


+ шо GOVERNMENT PRINTING OFFICE: 1968 OF — 792-818. 





ELEMENTARY 
STATISTICAL METHODS 
FOR FORESTERS 


Frank Freese 
Statistician 
Forest Products Laboratory 


(Maintained by the Forest Service at Madiso: ма Wes їп соорегаНоп 
with the University of Wisconsin) 


AGRICULTURE HANDBOOK 317 


U.S. Department of Agriculture 
Forest Service 


January 1967 





Corvallis, Oregon 
1981 
Lirhe-U.S.A. 





ACKNOWLEDGMENTS 


Professor George W. Snedecor and the Iowa State Uni- 
versity Press have generously allowed те to republish ma- 
terial from Statistical Methods, fifth edition, in tables 1, 
3-7 of this handbook. The editor and trustees of Bio- 
metrika concurred in Ше reprinting of table 4. I wish also 
to thank Dr. C. I. Bliss, of the Connecticut Agricultural 
Experiment Station, who originally prepared the data in 
table 6. 

І am indebted to the literary executor of the late Sir 
Ronald A. Fisher, F.R.S., Cambridge, to Dr. Frank Yates, 
F.R.S., Rothamsted, and to Oliver and Boyd, Ltd., Edin- 
burgh, for their permission to reprint present table 2 from 
Statistical Tables for Biological, Agricultural and Medical 
Research; and portions of present tables 3 and 7 from 
Statistical’ Methods for Research, Workers. 

Thanks are also due to those who reviewed the manu- 
script and contributed to it through their suggestions, 
particularly Thomas Evans, Virginia Polytechnic Insti- 
tute; Kenneth Ware, Iowa State University; and Donald 
Kulow, West Virginia University. 


Frank Freese 
Forest Products Laboratory 


г 


1 EN EN 


mE иш EN ш 


в 





а ый ый ий ый ый юй ыш юй кй ый ий ма заш ee 


Basic concepts .. 


Tools of the trade .. 


Sampling methods for continuous variables... 


Sampling methods for discrete variables 


Some other aspects of sampling .................. 





Why sample? .. 
Populations, parameters, and estimats 
Bias, accuracy, and precision .. 
Variables, continuous and discrete .. 
Distribution functions ....... 















Subscripts, summations, and brackets 
Variance 
Standard errors and confidence limits 
Expanded variances and standard errors. 
Coefficient of variation 
Covariance ... 
Correlation coefficient. 
Independence .. 
Variances of products, ratios, and sums 
Transformations of variables .. 
























Simple random sampling 
Stratified random sampling 
Regression estimators... 
Double sampling 
Sampling when units are unequal in size (including pps sampling) .. 
Two-stage sampling ... 
Two-stage sampling with unequal. 
Systematic sampling ... 











Simple random sampling—classification data . 
Cluster sampling for attributes ......... 

Cluster sampling for attributes—unequal 
Sampling of count variables . 












Size and shape of sampling units 






съ СА со ова = 


a 





References for additional reading. 
Practice problems in subseript and summation notation 


Tables ..... 


1. 
2. 
8. 
4. 


. The distribution of &................ 











Ten thousand randomly assorted digits 









Confidence intervals for binomial 
Arcsin transformation ............. 





вв в а af 








mH ЕН иш НЕ НЕ NER НИ 


= ша иш 


= ug 


PREFACE 


This handbook was written under the assumption that forest research 
workers want and should have a working knowledge of the simpler sta- 
tistical methods, and that most of them lack the time to extract this 
information from the comprehensive texts. It defines some basic terms 
and shows the computational routine for the statistical methods that have 
been found most useful in forestry. The meaning of various statistical 
quantities is discussed to a very limited degree. The general approach is 
based on the observation that most researchers have difficulty in learning 
the meanings and derivations of statistics (and have some reluctance to 
do so) until they have mastered the computational details. 

The purpose, then, is to give the reader a handy reference for useful 
basic techniques and also to convince him that statistical methods can be 
learned. Having absorbed this minimal dose without great pain, he may 
be inclined to make а more thorough study of the subject as presented in 
the standard textbooks. 

This handbook is an extensive revision and expansion of Guidebook for 
Statistical Transients, an informal release by (ће вате author first issued 
in 1956 by the Southern Forest Experiment Station and reissued їп 1963. 
The revision was completed after the author’s assignment to the Forest 
Products Laboratory, 

A complementary publication, Agriculture Handbook 232, Elementary 
Forest Sampling, by the same author, covers sampling methods and pro- 
cedures in detail. 


iii 



















Coefficient of. variaiion- 
Standard error of the mean. 


Sampling—measurement variabl 
Simple random sampling. 


Comparison of two or more groups by analysis of variance 
Complete randomization 
Multiple comparisons 

F test. with single degree of freedom. 
Seheffé's test. 
‘Unequal replication. 
Randomized block design 
Latin square 4 
Factorial experiments. 
The split ще design. 
is lots. - 


Reg 
















Confidence intervals. . 
Multiple re 
Tests of significa 
Coefficient of multipl 
The c-multipliers......- 
ilinear regressions and interactions. 
regressions- 
Analysis of covariance in a randomized block design. 
References for further reading. 


APPENDIX 
Tables 
1. 















. Accumulative distribution of 
Confidence intervals for binomi 
. Are sine transformation... 
- Significance of correlation coefficients 








Page 
їй 
1 
1 
2 
3 
3 
4 
5 
5 
6 
7 
9 


10 


31 


P: 


черви a$ 








GENERAL CONCEPTS 


Statistics—What For? 


To the uninitiated it may often appear that the statistician’s primary 
function is to prevent or at least impede the progress of research. And 
even those who suspect that statistical methods may be more boon than 
bane аге at times frustrated in their efforts to make use of the statistician’s 
wares. 

Much of the difficulty is due to not understanding the basic objectives 
of statistical methods. We can boil these objectives down to two: 

1. The estimation, of population parameters (values that characterize a 
particular population). 

2. The testing of hypotheses about these parameters. 

А common example of the first is the estimation of the coefficients а and 
b in the linear relationship, Y=a+bX, between the variables Y and X. 
То accomplish this objective one must first define the population involved 
and specify the parameters to be estimated. This is primarily the re- 
search worker's job. The statistician helps devise efficient methods of 
collecting the data and calculating the desired estimates. 

Unless the whole population is examined, an estimate of a parameter is 
likely to differ to some degree from the population value. The unique 
contribution of statistics to research is that it provides ways of evaluating 
how far off the estimate may be. This is ordinarily done by computing 
confidence limits, which have а known probability of including the true 
value of the parameter. "Thus, the mean diameter of the trees in a pine 
plantation may be estimated from a sample as 9.2 inches, with 95-percent 
confidence limits of 8.8 and 9.6 inches. These limits (if properly obtained) 
tell us that, unless а one-in-twenty chance has occurred in sampling, the 
true mean diameter is somewhere between 8.8 and 9.6 inches. 

"The second basic objective in statistics is to test some hypothesis about 
the population parameters. A common example is a test of the hypoth- 
esis that the regression coeffieient b in the linear model 


Y=a+bX 


has some specified value (say zero). Another example is a test of the 
hypothesis that the difference between the means of two populations is 
zero. 

Again, it is the research worker who should formulate meaningful hy- 
potheses to be tested, not the statistician. This task can be tricky. The 
beginner would do well to work with the statistician to be sure that the 
hypothesis is put in a form that can be tested. Once the hypothesis is set, 
it is up to the statistician to work out ways of testing it and to devise 
efficient procedures for obtaining the data. 

This handbook describes some of the methods of estimating certain 
parameters and testing some of the more common hypotheses. 





Probability and Statistics 


It is fairly well known that statisticians work with probabilities. They 
are supposed to know, for example, the likelihood of tossing coins heads up 
six times in a row, or the chances of a crapshooter making seven consecu- 
tive winning throws (“развев”), and many other such useful bits of infor- 
mation. (This is assumed to give them an edge in games of chance, but 
often other factors enter in there.) 










probability of making a single pass is really 0.493, then the probabi 
seven ог more consecutive ра в about 0.007 (or 1 in 141). "This is 
where statistics ends; you draw your own conclusions about the shooter. 
If you conclude that the shooter is pulling a fast one, then in statistical 
terms you are rejecting the hypothesis that the probability of the shooter 
making a single 0.493. 

Most statistical tests are of this nature. А hypothesis is formulated 
and an experiment is conducted or a sample is selected to test it. The 
next. step is to compute the probability of the experimental or sample 
results occurring by chance if the hypothesis is true. If this probability 

s less than some preselected value (perhaps 0.05 or 0.01), the hypothesis 

ected. Note that nothing has been proved—we haven't even proved 
that the hypothesis is false. We те inferred this because of the low 
probability associated with the experiment or sample resul 

Obviously our inferences may be wrong if we are given inaccurate 
probabilities. Reliable computation of these probabilities requires a 
knowledge of how the variable we are dealing with is distributed (that is, 
what the probability is of the chance occurrence of different values of the 
variable). Thus, И we know that the number of beetl in li 
traps follows what is called the Poiss е can compute the 
probability of catching X or more beetles. But, if we assume that this 
variable follows the Poisson when it actually follows the negative binomial 
distribution, our computed probabilities may be in error. 

Even with reliable probabilities, statistical tests can lead to the wrong 
conclusions. We will sometimes reject a hypothesis that is true. If we 
always test at the 0.05 level, we will make this mistake on the average of 
1 time in 20. We accept this degree of risk when we select the 0.05 level 
of testing. If we're willing to take a bigger risk, we can test at the 0.10 
or the 0.25 level. If we're not willing to take this much risk, we сап test 
at the 0.01 or 0.001 leve 

The fellow who always wears both а Бей and suspenders might, at this 
point, conclude that he should always test at the 0.00001 level. Then 
he'd be wrong only 1 time in 100,000. But a researcher can make more 
than one kind of error. In addition to rejecting a hypothesis that is true 


2 






























































E иш 


НЕ NE иш ша 


т 


ШИ ИШ БШ ии ин ши He кш пш ш 


B) 





(called a Type I error), he cani тоаке Не mistake of not rejecting a hypoth- 
esis that is false (called a Type II error). In crap shooting, it is а mistake 
to accuse an honest shooter of ‘cheating (Туре I errorrejecting a true 
hypothesis), but it is also а mistake to trust а dishonest shooter (Type II 
error—failure to reject a false hypothesis). 

The difficulty is that for а given вебоѓ data, reducing the risk of one kind 
of error increasés the risk of the other kind. ' If we set 15 straight passes 
as the critical limit for a crap shooter, then we greatly reduce the risk of 
making a false accusation (probability about 0.00025). But in so doing 
we have dangerously increased the probability of making a Type II error 
failure to detect a phony. А critical step in designing experiments is 
the attainment of an acceptable level of probability for each type of error. 
‘This is usually accomplished Бу specifying the level of testing (i.e., proba- 
bility of an error of the first kind) and then making the experiment large 
enough to attain an acceptable level of probability for errors of the second 
kind. 

It is beyond the scope of this handbook to go into Базе probability 
computations, distribution theory, or the calculation of Type II errors. 
But anyone who uses statistical methods should be fully aware that he is 
dealing primarily with probabilities and not with immutable absolutes. 
The results of a t, Е, or chi-square test must be interpreted with this in 
mind. Itis also well to remember that one-in-twenty chances do actually 
occur—about one time out of twenty. 


SOME BASIC TERMS AND CALCULATIONS 
The Mean 


One of the most: familiar and commonly estimated population param- 
eters is the mean. Given a simple random sample, the population mean 
is estimated by 





where: X;=The observed value of the ^^ unit in the sample. 
п The number of units in the sample. 


п 
Ex means to sum up all n of the X-values in the sample. 
Im 


1f there are N units in the population, the total of the X-values over all 
units in the population would be estimated by 


T-NX 


The circumflex (-) over the 7 is frequently used to indicate an estimated 
value as opposed to the true but unknown population value. 

It should be noted that this estimate of the mean is used for a simple 
random sample. It may not be appropriate if the units included in the 
sample are поб selected entirely at random. 

Methods of computing confidence limits for the mean are discussed in 
the section on sampling (see p. 11). 





Standard Deviation 


Another commonly estimated population parameter is the standard 
deviation. The standard deviation characterizes dispersion of individuals 
about the mean. It gives us some idea whether most.of the individuals 
in a population are close to the mean or spread out. The standard devia- 
tion of individuals in a population is frequently symbolized by о (sigma). 
On the average, about two-thirds of the unit values of a normal population 
will be within 1 standard deviation of the mean, About 95 percent will 
be within 2 standard deviations and about 99 percent within 2.6 standard 
deviations. 

We will seldom know or be able to determine ¢ exactly. However, 
given a sample of individual values from the population we can often 
make an estimate of о, which is commonly symbolized by s. For a simple 
random sample of n units, the estimate is 


Бу XP 
МЕНЕЕ 
z n-l 


where УХ? = Һе sum of squared values of all individual measurements. 
(ZX)!- the square of the sum of all measurements. 


This is equivalent to the formula 





where X =the arithmetic mean = D 


(X ~ Х) = ће deviation of an individual measurement from the 
mean of all measurements. 


Here is an example: Ten individual trees in a loblolly pine plantation 
were selected at random and measured. Their diameters were 9, 9, 11, 
9, 7, 7, 10, 8, 9, and 11 inches. Based on this sample, what is the arith- 
metic mean diameter and the standard deviation? Tabulating the 
measurements and squaring each of them: 


х, № 
9 81 
9 81 
и 121 
9 81 
7 49 
7 49 
10 100 
8 64 
9 81 
11 121 
Sums 90 828 
The mean 
УХ 90 
Х-== - 90 
4 


Ea 


(то 


mm КЕ ЕЕ uM аш иш 


ЕН 


БЕ 


ES 





EE иа 


The standard deviation 


эх+-®Ю* feng 90° 
suu] — ——— 
n=l 9 9“ 


Statisticians often speak in terms of the variance rather than standard 
deviation. The variance is simply the square of the standard deviation. 
The population variance is symbolized by о? and the sample estimate of 
the variance by 8°. 

Using the sample range to estimate the standard deviation.—The standard 
deviation of the sample is an estimate of the standard deviation (е) of the 
population. The sample range (Е) may also be used to estimate the 
population standard deviation. Table 1 (Appendix, р, 76)! shows the 
ratio of the population standard deviation to the range for simple random 
samples of various sizes. In the example we've been using, the range is 
11-7=4. Кога sample of size 10, the table gives the value of the ratio 


Ё ав 0.325. "Therefore, 4=0.325 and а 1.3 is an estimate of the true 


population standard deviation. Though easy to compute, this is an 
efficient estimator of ¢ only for very small samples (say less than 7 ob- 
servations). 


Coefficient of Variation 


In nature, populations with large means often show more variation than 
populations with small means. The coefficient of variation (C) facilitates 
comparison of variability about different sized means. It is the ratio of 
the standard deviation to the mean. A standard deviation of 2 for a 
mean of 10 indicates the same relative variability as a standard deviation 
of 16 for a mean of 80. The coefficient of variation would be 0.20 or 20 
percent in each case. 

In the problem discussed in the previous section the coefficient of varia- 
tion would be estimated by 


8 
С=ў= 





140167, ог 15.7 percent 


Standard Error of the Mean 


There is usually variation among the individual units of a population. 
The standard deviation is a measure of this variation. 

Since the individual units vary, variation may also exist among the 
means (or any other estimates) computed from samples of these units. 
Take, for example, a population with a true mean of 10. If we were to 
select four units at random, they might have a sample mean of 8. Ап- 
other sample of four units from the same population might have a mean 
of 11, another 10.5, and so forth. Clearly it would be desirable to know 
the variation likely to be encountered among the means of samples from 
this population. А measure of the variation among sample means is the 
standard error of the mean. It can be thought of as a standard deviation 


1 АЙ tables referred to are in Appendix 


among sample means; it is а measure of the variation among sample 
means, just as the standard deviation is a measure of the variation among 
individuals. As will be described in the section on simple random sam- 
pling, the standard error of the mean may be used to compute confidence 
limits for a population mean. 

"The computation of the standard error of the mean (often symbolized 
by sz) depends on the manner in which the sample was selected. For 
simple random sampling without replacement (i.e. a given unit cannot 
appear in the sample more than once) from 8 population having a total of 
N units the formula for the estimated standard error of the mean is 


Tn the problem discussed on page 4 we had n=10 and found that 
8=1.414 ог8*=2. If the population contained 1,000 trees, the estimated 
mean diameter (X —9.0 inches) would have a standard error of 


2 10 — 
ue 4 ( — 2 0.198 
=0.445 





The term ( ") в called the finite population correction or fpe. If 
sampling is with replacement (not too common? or if the sampling fraction 
s) is very small (say less than 1/20), the Гре may be omitted and the 


standard error of the mean for a simple random sample is simply 


The variance of the sample mean is simply the square of the standard 
error of the mean. 





Covariance 


Very often, each unit of a population will have more than а single 
characteristic. Trees, for example, may be characterized by their height, 
diameter, and form class, The covariance is a measure of the association 
between the magnitudes of two characteristics. If there is little or no 
association, the covariance will be close to zero. If the large values of 
one characteristic tend to be associated with the small values of another 
charac’ the covariance. will be negative. If the large values of 
one characteristic tend to be associated with the large values of another 
characteristic, the covariance will be positive. The population covari- 
ance of X and Y is often symbolized by с zy; the sample estimate by szy. 

Suppose that the diameter (inches) and age (years) have been obtained 
for a number of randomly selected trees. If we symbolize diameter by 
Y and age by X, the sample covariance of diameter and age is given by 


6 











ии mg иш ШО ше иш ии ша аш иш 


t 


[жей 


БЕ 


| 





mo sy mg па me md А па 


ай 


= ma 


zxy- ОТ) 


назе = геи 
This is equivalent to the formula 





s -23-2 -Y) 
— (n—1) 
If n «12 and the Y and X values were as follows: 
| Sums 
а ЕЕЗ иа каси 86 
х 20 40 30 45 25 45 30 40 20 35 25 40 395 
then 


(4)(20)+(9)(40)-+ ... --(11) 40) - 8999) 
bs 12-1 
_ 2,960 — 2,830.83 
ГА 1 





# 11.74 


"The positive covariance is consistent with the well known and есо- 
nomically unfortunate fact that the larger diameters tend to be associated 
with the older ages. 


Simple Correlation Coefficient 


The magnitude of the covariance, like that of the standard deviation, 
is often related to the size of the variables themselves. Units with large 
X and Y values tend to have larger covariances than units with small X 
and Y values. Also, the magnitude of the covariance depends on the 
scale of measurement; in the previous example, had diameter been ex- 
pressed in millimeters instead of inches, the covariance would have been 
298.196 instead of 11.74. 

The simple correlation coefficient, a measure of the degree of linear asso- 
ciation between two variables, is free of the effects of scale of measure- 
ment. It can vary from —1 to +1. A correlation of 0 indicates that 
there is no linear association (there may be a very strong nonlinear asso- 
ciation, however). A correlation of +1 or —1 would suggest а perfect 
linear association. As for the covariance, a positive correlation implies 
that the large values of X are associated with the large values of Y. If 
the large. values of X are associated with the small values of Y, the сог- 
relation is negative. 

The population. correlation coefficient is commonly symbolized by р 
(rho), i the sample-based estimate by r. The population correlation 
coefficient is defined to be 


к Соуагіапсе of X and Y 
У (Variance of X) (Variance of Y) 


For a simple random sample, the sample correlation coefficient is 
computed as follows: 


р 


PER за. 
88, (227) (ди) 





where: sz, =Sample covariance of X and У 
s. =Sample standard deviation of X 
s, =Sample standard deviation of У 
Уту = Corrected sum of XY products 





-syy Z2 CY) 
n 
2? = Corrected sum of squares for X 
* (ХХ)? 
=>х+—-——- 
n 
Уу? = Corrected sum of squares for Y 
y: (ЕИ)? 
-2---- 
n 


For the values used to illustrate the covariance we have: 


Lay = (4)(20) 4- (9) (40) +... s 0) - ERE 139 1g; 


" 2 86? 
Zyt-4Mp9He... +1 -т5 =57.6667 


=922.9167 
50, 
= 129.1667 129.1007. 
№ (57.6667)(922.9167) 230.0080 ` 





Correlation or chance-—The computed value of a statistic such as the 
correlation coefficient depends on which particular units were selected 
for the sample. Such estimates will vary from sample to sample. More 
important, they will usually vary from the population value which we try 
to estimate, 

In the above example, the sample correlation coefficient was 0.56, 
Does this mean that there is а real linear association between Y and X? 
Or could we get а value as large as this just by chance when sampling а 
population in which there is no linear association between Y and X (1.е., 
a population for which p=0)? 

This can be tested by referring to table 7 (Арр.). The column headed 
“Degrees of freedom" refers to the sample size. А correlation coefficient 
estimated from a simple random sample of n units will have (n—2) de- 
grees of freedom. Looking in the row for 10 degrees of freedom we find 
in the column headed “5%” a value of 0.576. This says that in sampling 
from a population for which p=0 we would get a sample value as large as 
0.576 just by chance about 5 percent of the time. Sample values smaller 
than 0.576 could occur more often than this. Thus we might conclude 
that our sample r=0.56 could have been obtained by chance in sampling 
from a population with a true correlation ой zero. 

This test result is usually summarized by saying that the sample core- 
relation coefficient is not significant at the 0.05 level. In statistical 
terms, we tested the hypotheses that p=0 and failed to reject the hypoth- 
esis at the 0.05 level. This is not exactly the same as saying that we 
ct the hypothesis or that we have proved that p=0. The distinction 
is subtle but real. 


8 













t 








mu ки HM 


m ed ma sm ви на 


For a sample correlation larger'than 0.576 we might decide that the 
departure from a value of zero is larger than we would expect by chance. 
Statistically we would reject the hypothesis that p=0. 


Variance of a Linear Function 


Quite often we will want to combine variables or population estimates 
іп a linear function. For example, if the mean timber volume per acre 
has been estimated as X, then the total volume on M acres will be МХ; 
the estimate of total volume is a linear function of the estimated mean 
volume. If the estimate of cubic volume per acre in sawtimber is X, and 
of pulpwood above the sawtimber top is X», then the estimate of total 
cubic foot volume per acre is Ж+Х, If on a given tract the mean 
volume ре half-acre is X; for spruce and the mean volume per quarter- 
acre is X, for yellow birch, then the estimated total volume per acre of 
spruce and birch would be 2X ,2-4X ,. 

In general terms, a linear function of three variables (say Ху, Хз, and 


Хз) can be written as 
Lea XviraiXiraiXs 


where аз, аз, and аз are constants. 2 
If the variances аге 817, 82%, and ви (for Ху, Xan and X; respectively) and 
the covariances, are 81,2, $1,3, and ss,» then the variance of L is given by 


812 = 912812 02°89? +0583? + 2(аазв а а1@з81 1 030381,3) 


The standard deviation (or standard error) of L is simply the square root 
of this, 

The extension of the rule to cover any number of variables should be 
fairly obvious. 

Some examples.—I. The sample mean volume per acre for a 10,000- 
acre tract is X = 5,680 board feet with a standard error of = 682 (So 
в:2 = 399,424). The estimated total volume is 


L=10,000(X) = 56,800,000 board feet. 
The variance of this estimate would be 
за? =: (10,000)*(sz") = 39,942,400,000,000 


Since the standard error of an estimate is the square root of its variance, 
the standard error of the estimated total is 


вис V317 = 6,320,000 


II, In 1955 a random sample of 40 one-quarter-acre circular plots was 
used to estimate the cubic foot volume of a stand of pine. Plot centers 
were monumented for_possible relocation at a later time. The mean 
volume per plot was X1—225 cubic feet. The plot variance was s.3= 
8,281 so that the variance of the mean was sz?=8,281/40= 207.025. 

In 1960 a second inventory was made using the same plot centers. 
This time, however, the circular plots were only one-tenth acre. The 
mean volume per plot was Х:=122 cubic feet. The plot variance was 
$4177 6,084, so the variance of the mean was 8:2- 152.100. "The covari- 
ance of initial and final plot volumes was s; ,259, making the co- 
variance of the means s3,,%=4,259/40= 106.4 

The net periodic growth per acre would be estimated as 


@=10Х,—4Х,=10(122)—4(225) = 320 cubic feet per acre: 









Ву the rule for linear functions the variance of G would be 


за? — (10) 2,24 ( —4)?54,?2-2(10) ( 4), 
00(152.100) 4-16(207.025) —80(106.475) 
= 10,004.4 


In this example there was a statistical relationship between the 1960 
and 1955 means because the same plot locations were used in both sam- 
ples. The covariance of the means (вг, з) is а measure of this relation- 
ship. If the 1960 plots had been located at random rather than at the 
1955 locations, the two means would have been considered statistically 
independent and their covariance would have been set at zero. In this 
case the equation for the variance of the net periodic growth per acre (©) 
would reduce to 


80? = ЦОуви (4) ва? 
= 100(152.100) +16(207.025) = 18,522.4 







SAMPLING—MEASUREMENT VARIABLES 
Simple Random Sampling 


Most foresters are familiar with simple random sampling. Asin any 
sampling system, the aim is to estimate some characteristic of а popula- 
tion without measuring all of the population units. In a simple random 
sample of size n, the units are selected so that every possible combination 
of n units has an equal chance of being selected. ТЕ sampling is with re- 
placement, then at each stage of the sampling all units should have an 
equal chance of being selected. If sampling is without replacement, then 
at any stage of the sampling each unused unit should have an equal 
chance of being selected. 


Sample estimates of the population mean and total.—From a population 
of N=100 units, n=20 units were selected at random and measured. 
Sampling was without replacement—once a unit had been included in the 
sample it could not be selected again. The unit values were: 


10 9 10 9 11 
16 1 7 12 12 
11 3 5 11 14 
8 13 2 2 10 


Sum of all 20 random units 214 


From this sample we estimate the population mean as 


A population of № = 100 units having a mean of 10.7 would then have an 
estimated total of 
7 - МХ =100(10.7) = 1,070 


Standard Errors 


The first step in calculating a standard error is to obtain an estimate of 
the population variance (с?) or standard deviation (а). Аз noted in a 
previous section, the standard deviation for a simple random sample is 
estimated by 


10 


Бей 


ш ша ш ш 


ша иш ВИ иш мо ON иш ш 


ва ГИ 


г 








NS; mu 
as = — 50 


n=l 19 
=v/13.4842=3.672 
For sampling without replacement, the standard error of the mean is 


uE 1-5)- 134842, 20 
“Ма ТА 20 100 
-М/0.589868..#:0.784 


From the formula for the variance of э linear function we find that the 
variance of the estimated total is 


вр2= Мг? 
"The standard error of the estimated total is the square root of this, or 
вре №зг = 100(0.734) — 73.4 





Confidence Limits 


We have it on good authority that “уоп сап fool all of the people some 
of the time," The oldest and simplest device for misleading folks is the 
barefaced lie, A method that is nearly as effective and far more subtle is 
to report а sample estimate without any indication of its reliability. 

Sample estimates are subject to variation. How much they vary de- 
pends primarily on the inherent variability of the population (е?) and on 
the size of the sample (n) and of the population (N). 

The statistical way of indicating the reliability of an estimate is to 
establish confidence limits. For estimates made from normally dis- 
tributed populations, the confidence limits are given by 


(Estimate) + (t) (Standard Error) 


For setting confidence limits on the mean and total we already have 
everything we need except for the value of t, and that can be obtained 
from the table of the ¢ distribution (table 2 in the appendix). In this 
table, the column headed df (degrees of freedom) refers to the size of the 
sample, For the mean (or total) of a simple random sample we would 
select a ¢ value with (n—1) degrees of freedom. The columns labeled 
"Probability" refer to the kind of odds we demand. И we want to say 
that the true mean (or total) falls within certain limits unless a one-in- 
twenty chance has occurred, we use the ¢ value in the column headed .05. 
1f we want to say that the true value lies within a set of limits unless а. 
one-in-one hundred chance has occurred, we select ¢ from the column 
headed .01, 

In the previous example the sample of »=20 units had a mean of 
X =10.7 and a standard error of s;=0.734. For 95-percent confidence 
limits on the mean we would use a t value from the 05 column and the 
row for 19-degrees of freedom. Аз Ё о =2.093, the confidence limits are 
given by. 

X se (0) (82) 910.7 2- (2.093) (0.734) —9.16 to 12.24 


This says that unless a one-in-twenty chance has occurred in sampling, 
the population mean is somewhere between 9.16 and 12.24. It does not 


11 





say where the mean of future samples from this population might fall. 
Nor does it say where the mean may be if mistakes have been made in the 
measurements. 

For 99 percent confidence limits we find t,9:= 2.861 (with 19 degrees of 
freedom), and so the limits are 


10.7 +: (2.861) (0.734) =8.6 to 12.8. 


These limits are wider, but they are more likely to include the true popu- 
lation mean. 
For the population total the confidence limits аге: 


95-регсепё limits—1,070 + (2.093) (73.4) =916 to 1,224 
99-percent limits—1,070-+ (2.861) (73.4) =860 ќо 1,280 


For large samples (n>60) the 95-percent limits are closely approximated 
by А 

Estimate (2) (Standard Error) 
and the 99-percent limits by 


Estimate: (2.6) (Standard Error) 
Sample size 


Samples cost money. So do errors. The aim in planning a survey 
should be to take enough observations to obtain the desired precision— 
no more, no less. 

The number of observations needed in a simple random sample will 
depend on the precision desired and the inherent variability of the popu- 
lation being. sampled. Since sampling precision is often expressed in 
terms of confidence interval on:the mean, it is not unreasonable in plan- 
ning a survey to say that in the computed confidence interval 


Xt; 


we would like to have the ts; equal to or less than some specified value E, 
unless a one-in-twenty (or one-in-one hundred) chance has occurred in 
sampling. That is, we want Р 

ва“ Е 


G3 


Solving this for n gives the desired sample size. 
ts? 
aps 
To apply this equation we need.to have an estimate (s?) of the population 
variance and a value for students tat the appropriate level of probability. 
The variance estimate can be a real problem. Опе solution is to make 
the sample survey in two stages. In the first stage, nı random observa- 
tions are made and from these an estimate (s?) of the variance 18 com- 
puted. Then this value is plugged into the sample size equation 
ts? 


пя ра 


ог, Since s; 





==, we want 
п 





12 


Eg & 


е 


where: t has ти — 1 degrees of freedom and is:selected from table 2 of the 
appendix. The computed value of n is the total size of sample needed. 
Ав не Вале already observed т; units, this means that. we will have to 
observe:(n —n4) additional units. 

И pre-sampling as described above is not feasible then it will be neces- 
sary. {о make a guess at the variance. Assuming our knowledge of the 
population is such that the guessed variance (s?) can be considered fairly 
reliable, then the size of sample (n) needed to estimate the mean to within 
+ Е units is approximately 


for 95 percent confidence and 
20(s*) 


п ЗЕ 





for 99 percent confidence. 

Less reliable variance estimates could be doubled (as a safety factor) 
before applying these equations. In many cases the variance estimate 
may be so poor as to make the sample size computation just so much 
statistical window dressing. 

When sampling is without replacement (as it is in most forest sampling 
situations) the sample size estimates given above apply to populations 
with ап extremely large number (№) of units so that the sampling fraction 
(n/N) is very small. If the sampling fraction is not small (say n/N > .05) 
bs the sample size estimates should be adjusted. "This adjusted value 
of nis 

n 


Мат 


п 
1+9 


Warning! It is important that the specified error (E) and the estimat- 
ed variance (8?) be on the same scale of measurement. “We could not, for 
example, use a board-foot variance in conjunction with an error expressed 
in cubic feet. Similarly, if the error is expressed in volume per acre, the 
variance must be put on a per-acre basis. 

Suppose that we plan to use quarter-acre plots in a survey and estimate 
the variance among plot volumes to Бе 82 160,000. If the error limit is 
Е = 500 feet per acre, we must convert the variance to an acre basis ог the 
error to а quarter-acre basis. To convert а quarter-acre volume to а 
per-acre basis we multiply by 4, and to convert a quarter-acre variance 
to an acre variance we multiply by 16. Thus, the variance would be 
2,560,000 and the sample-size formula would be 


(2,560,000) _ 


ns бода = 20.24) 


Alternatively, we can leave the variance alone and convert the error 
statement from an acre to а quarter-acre basis; i.e., Е- 125. Then the 
sample-size formula is 

(60000) , 
n= тобу 700024), as before. 

The problem of units of measure is not difficult, but the unwary can 

easily go astray. 


13 


Stratified Random Sampling 


In stratified sampling, a population is divided into subpopulations 
(strata) of known size, and a simple random sample of at least two units 
is selected in each subpopulation. This approach has several advantages. 
For one thing, if there is more variation between subpopulations than 
within them, the estimate of the population mean will be more precise 
than that given by a simple random sample of the same size, Also, it 
may be desirable to have separate estimates for each subpopulation (e.g., 
in timber types or administrative subunits). And it may be administra- 
tively more efficient to sample by subpopulations. 


Example: 

A 500-acre forested area was divided into three strata on the basis of 
timber type. A simple random sample of 0.2-acre plots was taken in 
each stratum, and the means, variances, and standard errors were com- 
puted by the formulae for a simple random sample. These results, along 
with the size (Мл) of each stratum (expressed in number of 0.2-acre plots), 
are: 





a= Squared 
Uithi standar 
Within- error of 


Type Stratum | Stratum | Sample | Stratum |- stratum 
number (h) | size (АЛО | size (л) | mean Gn) varinnce tho moos 
n з, 








10,800 | — 353.96 


9,680 631.50 
| 





Upland 
hardwoods.. 

Bottom-land 
hardwoods. 




















The squared standard error of the mean for stratum h is computed by 
the formula given for the simple random sample 


Thus, for stratum 1 (pine type), 
‚_ 10,860, 
sm 
30 
Where the sampling fraction (ж, / N,) is small, the fpe can be omitted. 
With this data, the population mean is estimated by 


Kun ye 








where N= ХМ, 
For this example we have 


х МХЛ ХХ, _ 1,350(251)-+700(164) +450(110) 
N ^ ,500 














14 


mas Өш эш GNE пи ш 


B 


" 


wa иш NS EN м 


са 





The formula for the standard error of the stratified mean is cumbersome 
but not complieated. 


$2474 a [15/9] 


Е „(ево COO GST) бо) 








(2,500)? 
12.74 
If the sample size is fairly large, the confidence limits on the mean are 
given by wit E 
95-percent confidence limits + X,,+2ss,, 


99-percent confidence limits X,,+2.682,, 


There is no simple way of computing the confidence limits for small 
samples. 


Sample allocations 


If а sample of п units is taken, how many units should be selected in 
each stratum? Among several possibilities, the most common procedure 
is'to-alloeate the sample in proportion to the size of the stratum; in a 
stratum having t ths of the units of the population we would take 
two-fifths of samples, In the population discussed in the previous 
example the proportional allocation of the 55 sample units would have 
been (and was) as follows; 


Sample 











Relative 
Stratum | size (Ni/N) | allocation 
1 0.54 29.7 ог 30 
2 0.28 15.4 ог 15 
3 0.18 9.9 ог 10 
Sums 1.00 55 


For proportional allocation the number of sample units to be selected in 
stratum А is 
“(Ки 
ву) 


Some other possibilities are equal allocation, allocation proportional to 
estimated value, and optimum allocation. In optimum allocation an 
attempt is made to get the smallest standard error (of X,,) possible for a 
sample of n units, This is done by sampling more heavily in the strata 
having a larger variation. The equation for optimum allocation is 


Optimum.alloeation obviously requires estimates of the within-stratum 
variances—information that may be difficult to obtain. 

A refinement of optimum allocation is to take sampling cost differences 
into account and allocate the sample so as to get the most information рег 
dollar. If the cost per sampling unit in stratum А is cj, the equation is 


15 





път =" п. 
Nase ) 
(7 % 


To estimate the size of sample to take for a specified error at a given 
level of confidence, it is first necessary to decide on the method of alloca- 
cation. Ordinarily, proportional allocation is the simplest and perhaps 
the best choice. With proportional allocation, the size of sample needed 
to be within +Æ units of the true value at the 0.05 probability level can 
be approximated by 


Sample size 


NN pa?) 
E 


РАНИ 











i 


For the 0.01 probability level, use 6.76 in place of 4. 

To illustrate, assume that prior to sampling the 500-acre forest, we had 
decided that we wish to estimate the mean volume per acre to within 
+100 cubic feet per acre unless а 1-in-20 chance occurs in sampling. A: 
we plan to sample with 0.2-acre plots, the error specification should be 
put on a 0.2-acre basis. Therefore, 


Е=20 


From previous sampling the stratum variances for 0.2-acre volumes аге 
estimated to be 





812 8,000 8а? = 10,000 832 5,000 
"The stratum sizes are known to Бе as previously shown 
= 1,350, №, = 700, М «2,500 





"Therefore, 


2,500[ (1,350) (8,000) + (700) (10,000) + (450) (5,000)] _ 
(2,500)*(20)? 
4 


- тл 
+{(1,350) (8,000) + (700) (10,000) + (450) (5,000); ог 78 


The 78 sample units would now be allocated to the strata by the formula 


Ni 
m= (bs 


7,242, 13722, and пз= 14 


giving 


SAMPLING— DISCRETE VARIABLES 
Random Sampling 


The sampling methods discussed in the previous sections apply to data 
that are on a continuous or nearly continuous scale of measurement. These 
methods may not be applicable if each unit observed is classified as alive 
or dead, germinated or not germinated, infected or not infected. Data of 
this type may follow what:is known as the binomial distribution. ‘They 
require slightly different statistical techniques. 


16 


ва Ge 


ош um 


C 


mE mS ои NS NN NE о ш 


= ищ 


а ша ы EM 


а на mà па за 


ще E 


As an illustration, suppose that а sample of 1,000 seeds was selected at 
random and tested for germination. If 480 of the seeds germinated, the 
estimated viability for the lot would be 

480 
p= 7,000 048, or 48 percent 

Confidence limits for the population viability are easily obtained from 
appendix table 5: look in the “fraction observed” column for 0.48, and 
then move crosswise to the column for a sample of size 1,000. The figures 
in this column of the 95-percent side of the table are 45 and 51. Thus, 
unless a one-in-twenty chance has occurred in sampling, the germination 
percent for the population is between 45 and 51. The 99-percent confi- 
dence limits, obtained in the same manner, are 44 and 52. 

If the sample size is n= 10, 15, 20, 30, or 50, it will be necessary to look 
in the far left column for the number actually observed (rather than the 
fraction observed). Then in the appropriate sample-size column will be 
found the confidence limits for the fraction observed. Thus, for a germina- 
tion of 24 seeds in a sample of 50 (so ӯ =0.48) the 95-percent confidence 
limits would be 0.34 and 0.63. 

For large samples (say п> 250) with proportions greater than 0.20 but 
less than 0.80, approximate confidence limits can be obtained another 
way. First we compute the standard error of р by the equation 


(1-5) 


Then, the 95-percent confidence limits are given by 





95-percent confidence interval + +: [260 +4 ] 


Applying this to the above example we get 


га (0480058 (fpe ignored) 


= 0.0158 








Апа, 
i . — 
95-percent confidence interval = 48 [ 200.0158) s] 
= .448 to .512 
The 99-percent confidence limits are approximated by 


99-percent confidence interval =p [ 2.655. +z] 


Sample size 


Table 5 can also be used to estimate the number of units that would 
have to be observed in a simple random sample in order to estimate a 
population proportion with some specified precision. 

Suppose, for example, that we wanted to estimate the germination 
percent for a population to within plus or minus 10 percent (or 0.10) at 
the 95-percent confidence level. The first step is to guess about what the 
proportion of seed germinating will be. If a good guess is not possible, 
then the safest course is to guess р =0.50 as this will give the maximum 
sample size. 


17 








Next, pick any of the sample sizes given in the table (10, 15, 20, 30, 50, 
100, 250, and 1,000) and 1 at the confidence interval for the specified 
value of р. Inspection of these limits will tell whether or not the precision 
will be met with a sample of this size or if a larger or smaller sample would 
be more appropriate. 

Thus, if we guess фр == 0.2, then in a sample of n=50 we would expect to 
observe (0.2) (50) = 10, and the table says that the 95-percent confidence 
limits on р would be 0.10 and 0.34. Since the upper limit is not within 0.10 
of p, a larger sample would be needed. For а sample of п = 100 the limits 
are 0.13 to 0.29. Since both of these values are within 0.10 of p, a sample 
of 100 would be adequate. 

Tf the table indicates the need for a sample of over 250, the size can be 
approximated by 


„Фа Ср, for 95-percent confidence 


or, 





„гора -Р) 

ЗЕ? 

where: Е = The precision with which p is to be estimated (expressed in 
same form as р, either percent or decimal). 


„ for 99-percent confidence 


Cluster Sampling for Attributes 


Simple random sampling of discrete variables is often difficult or 
impractical. In estimating plantation survival, for example, we could 
select individual trees at random and examine them, but it wouldn't make 
much sense to walk down a row of planted trees in order to observe a 
single member of that row. It would usually be more reasonable to select. 
rows at random and observe all of the trees in the selected row. 

Seed viability is often estimated by randomly seleeting several lots of 
100 or 200 seeds each and recording for each loc the percentage of the 
seeds that germinate. 

These are examples of cluster sampling; the unit of observation is the 
cluster rather than the individual tree or single seed. The value attached 
to the unit is the proportion having a certain characteristic rather than 
the simple fact of having or not having that characteristic. 

ТЕ the clusters are large enough (say over 100 individuals per cluster) 
and nearly equal in size, the statistical methods that have been described 
for measurement variables can often be applied. Thus, suppose that the 
germination percent of a seedlot is estimated by selecting n=10 sets of 
200 seed each and observing the germination percent for each set. 
If the results were 


Set } 4 2 3 4 5 6 T 8 9 10 | бам 





Géfmination 
pereent (p) 78.5 82.0 86.0 80.5 745 780 70.0 81.0 80.5 83.5 803.5 


then the mean germination percent is estimated by 
86: 


„20 80 
?-а 1 









7 80.35 percent 


18 











The standard deviation of p is 


5a п-1. 


210002778 =3.163 


And the standard error for ф is 


* pora. 1.000 (fpe ignored) 


Note that n and N in these equations refer to the number of clusters, not 
to the number of individuals. 
The 95-percent confidence interval, computed by the procedure for 
continuous variables: 
= (Lo) (55), (t has (1—1) =9 df) 
= 80,35 + 2.262(1,000) — 78.1 to 82.6 











Transformations 

The above method of computing confidence limits assumes that the 
individua] percentages follow something close to а normal. distribution 
with hoi eous variance (i.e., same variance regardless of the size of 
the percent). If the clusters are small (say less than 100 individuals per 
cluster) or some of the percentages are greater than 80 or less than 20, 
the assumptions may not be valid and the computed confidence limits will 
be unreliable. 

In such cases it may be desirable to compute the transformation 

y=are sine percent 

and to analyze the transformed variable. The transformation is easily 
made by means of table 6. Thus in the previous example we would have 
78.5 820 86,0 80.5 745 78.0 79.0 81.0 80.5 83.5 | Sum 
624 64.9 68.0 63.8 507 62.0 62.7 64.2 63.8 66.0 | 637.5 
Then working with the transformed variables, 


pi =63.75, corresponding to a mean percentage of 80.4 


The variance of y is 
2 
водене... +06,0 8757 
--- — =5.227222 


And the standard error of j is 


percent 








arc sine y% 





s= 





.723 


19 





The 95-percent confidence interval on mean y is 
G+ (Los) (85) = 63.75- (2.262) (0.723) 
762.11 to 65.39 
‘These limits correspond to percentages of 78.1 to 82.7. 
Because the clusters are fairly large and the value of p close to .50, the 
transformation did not have much effect in this case. 


CHI-SQUARE TESTS 
Test of Independence 


Individuals are often classified according to two (or more) distinct 
systems. A tree can be classified as to species and at the same time accord- 
ing to whether it is or is not infected with some disease. A milacre plot can 
be classified as to whether or not it 18 stocked with adequate reproduction 
and whether it is shaded or not shaded. Given such a cross-classification, 
it may be desirable to know whether the classification of an individual 
according to one system is independent of its classification by the other 
system. In the species-infection classification, for example, independence 
of species and infection would be interpreted to mean that there is no 
difference in infection rate between species (i.e., infection rate does not 
depend on species). 

The hypothesis that two or more systems of classification are inde- 
pendent can be tested by chi-square. The procedure can be illustrated. by a 
test of three termite repellents, A batch of 1,500 wooden stakes was 
divided at random into three groups of 500 each, and each group received 
а different termite-repellent treatment. The treated stakes were driven 
into the ground, with the treatment at any particular stake location being 
selected at random. Two years later the stakes were examined for termites, 
The number of stakes in each classification is shown in the following 2 by 
3 (two rows and three columns) contingency table: 

| Grouj Grou Grou 
I Е п d ш Р Subtotals 














termites. 


у 148 210 551 
Not attacked. - 


20 | 949 
500 | 1,500 








Subtotals.......... * 500 
И the data in the table be symbolized as shown below: 

















1 п ш 
Attacked. a а а 4 
Not attacked. bi be b | B 

BEA T, T, G 


the test of independence is made by computing 


Rega) 


"M < | 1931649) 
= (551 (949) 





— (307) (551)) 


2 
500 + 





20 


ка 


mm ша 


Ба 


mS uy ши my за UN 


шо ш эш 


с: 








‚ =17.66 


This result is compared to the tabular value of x? (table 4) with (с- 1) 
degrees of freedom, where c is the number of columns in the table of data. 
If the computed value exceeds the tabular value given in the 0.05 column, 
the difference among treatments is said to be significant at the 0,05 level 
(i.e.; we reject the hypothesis that attack classification is independent of 
treatment classification). 

In this example, the computed value of 17.66 (2 degrees of freedom) 
exceeds the tabular value in the 0.01 column, and so the difference in rate 
of attack among treatments is significant at the 1. it level. Examina- 
tion of the data suggests that this is primarily due to the lower rate of 
attack on the Group II stakes. 

The r by c contingency table.—The above example is a simple case of the 
chi-square test of independence in an r by c table (ie, г rows and с 
columns). Thus, if a number of randomly selected forest stands were 
classified as to soil group and forest type the results might be as follows: 











Forest {уре 
Soil group | I I Ш | Subtotal 
1 27 48 62 137 
2 46 67 145 
3 26 51 61 138 





Subtotal 85 M5 1% 420, 





If the г by c table is represented in symbols: 











Forest type 
Soil group и II | Subtotal 
1 ди au au КА 
2 ап аз an $5 v 
3 ап ам ав 5: 
Subtotal T, T, Ts в 








then.the test of independence is 


каб (ср), with (r—1)(c—1) degrees of freedom 


ST; 
In this example ` 
| 1 [ ((420)(27) — (137)(85))* : 
Хит ОВИЕ +... 


Ken (188) (190))"] 
dos CI , 


21 





which is not significant at, the 0.05 level. Thus, the test has failed to 

demonstrate any real association between forest types and soil groups. 
The test of independence can be extended to more than two classifica~ 

tion systems, but formulating meaningful hypotheses may be difficult. 


Test of a Hypothesized Count 


A geneticist hypothesized that, if a certain cross were made, the progeny 
would be of four types, in the porportions 


A=0.48, В =0.32, C=0.12, D=0.08 


The actual segregation of 1,225 progeny is shown below, along with the 
numbers expected according to the hypothesis. 








Type А B С р | Total 





401 164 


Number (X;). 118 225 
392 147 98 | 1,225 


Expected (ть) 














As the observed counts differ from (Бозе ехресед, we might wonder 
if the hypothesis is false. Or, can departures ns large as this occur strictly 
by chance? 


The chi-square test. is 


x= x (& 





97) with (k—1), degrees of freedom 


where: 





k=The number of groups recognized, 
X; The observed count for the i^^ group. 
m, The count expected in the i group if the hypothesis is true. 





For the above data 





(401—392)? (164—147)? , (118—98)*_ o g- 
58 Сара НиТ Нав 7986 

This value exceeds the tabular x? with 3 degrees of freedom at the 0.05 
level. Hence the hypothesis would be rejected (if the geneticist believed in 
testing at the 0.05 level). 


Bartlett's Test of Homogeneity of Variance 


Many of the statistical methods described later are valid only if the 
variance is homogeneous. The ¢ test of the following section assumes that 
the variance is the same for each group, and so does the analysis of 
variance. The fitting of an unweighted regression as described in the last, 
section also mes that the dependent variable has the same degree of 
variability (variance) for all levels of the independent variables. 


22 











a 


С 


а 


е ыш NM ю иш UM за 


са 





E mg 


s] 


Bartlett’s test offers a means of evaluating this assumption. Suppose 
that we have taken random samples in each of four groups and obtained 
variances (s?) of 84.2, 63.8, 88.6, and 72.1 based on samples of 9, 21, 6, 
and 11 units, respectively. We would like to know if these variances could 
have come from populations all having the same variance. The quantities 
needed for Bartlett’s test are tabulated here: 








Corrected 
Variance sum of squares 1 
Group (в) (n2) 88 Wi loge! (n-l)ogs?) 
1 8 673.6 0.125 1.92531 1540248 
2 | 68 2 1,276.0 0.050 180482 3600640 
Н 886 5 443.0 0:200 194743 973715 
4 | 721 10 721.0 0.100 185794 1857940 
ke4 groups Sums 43 3,1136 0475 79.81543 


where: k=The number of groups (=4). 


SS= The corrected sum of squares e ( zai ZO! 





n —1)s? 


From this we compute the pooled within-group variance 


HE 8080. 4003 
инт as 72409 


log 8? = 1,85979 
Then the test of homogeneity is: 
Хари (2.3026)(Пов $2) (X(n;— 1)) — (п, — 1) log #1), 
with (k— 1) degrees of freedom 





and 


In this сазе, 
ХЗасте (2,8026)[ (1.85979) (43) — 79.81543) 
= 0,358 


This value of x? is now compared with the value of x? in table 4 for the 
desired probability level, A value greater than that given in the table 
would lead us to reject the homogeneity assumption. 

The х" value given by the above equation is biased upward. If x? is 
nonsignificant, the bias is not important. However, if the computed x* 
is just a little above the threshold value for significance, a correction for 
bias should be applied. The correction is: 


M 155) -:-n] 
3(k—1) 





о The original form of this equation used natura] logarithms in place of the common 
logarithms shown here, The natural log of any number is approximately 2.3026 times 
its common logarithm-—hence the constant of 2.3026 in the equation. In computations, 
common logarithms are usually more convenient than natural logarithms. 


23 


за-040я8- ) 
34-1) 
= 1.0502 
The corrected value of x? is then 


= Uncorrected х? 0.358 _ 
Corrected а — =т0502 0.341 


COMPARING TWO GROUPS BY THE t TEST 


The t Test for Unpaired Plots 


Ап individual unit in a population may be characterized in a number of 
different ways. A single tree, for example, ean be described as alive or 
dead, hardwood or softwood, infected or uninfected and so forth. When 
dealing with observations of this type we usually want to estimate the 
proportion of a population having a certain attribute. Or, if there are two 
or more different groups, we will often be intefested in testing whether or 
not the groups differ in the proportions of individuals having the specified 
attribute. Some methods of handling these problems have been discussed 
in previous sections. 

Alternatively, we might describe a tree by а measurement of some 
characteristics such as its diameter, height, or cubic volume. For this 
measurement type of observation we may wish to estimate the mean for а. 
group as discussed in the section on sampling for measurement variables, 
If there are two or more groups we will frequently want to test whether or 
not the group means are different. Often the groups will represent types 
of treatment which we wish to compare. Under certain conditions, the t or 
F tests may be used for this purpose. 

Both of these tests have a wide variety of applications. For the present, 
we will confine our attention to tests of the hypothesis that there is no 
difference between treatment (or group) means. The computational 
routine depends on how the observations have been selected or arranged. 
The first illustration of a t test of the hypothesis that there is no difference 
between the means of two treatments assumes that the treatments have 
been assigned to the experimental units completely at random. Except for 
the fact that there are usually (but not necessarily) an equal number of 
units or “plots” for each treatment, there is no restriction on the random 
assignment of treatments. 

In this example the “treatments” were two races of white pine which 
were to be compared on the basis of their volume production over a 
specified period of time. Twenty-two square one-acre plots were staked 
out for the study. Eleven of these were selected entirely at random and 
planted with seedlings of race A. The remaining eleven were planted with 
seedlings of race B, After the presctibed time period the pulpwood volume 
(in cords) was determined for each plot. The results were as follows: 


24 


= 





p 


ws us gus my а 


mE ws ш 


[p 





Ey = 


229 | 





Race A Касе В 
ибо 9 69 
8 10 11 9 13 8 
юзи 6 5 6 
8 8 0 7 
Sum 99 Вит 88 
Average- 9.0 Average —8.0 


To test the hypothesis that there is no difference between the race 
means (sometimes referred to as a null hypothesis) we compute 


— 





) 
(na) (ng) 


where: X, and Хв= The arithmetic means for groups A and B. 
na and пв The number of observations in groups А and B (пл 
and пв do not have to be the same). 
s= пе pooled within-group variance (calculation shown 
elow). 


To compute She pooled within-group variance, we first get the corrected 
sum of squares (88) within each group. 


88, ХО) афва... ри 99g, 
ПА 1 
SSos 2х1 X9 др. 46: Ова 
пв 1 
Then the pooled variance is 
n Satse 88, 
а) 010) 2074 
Непсе, 
90—80 _ 10 


t=— —aiis 
T nu 4//.800000 


This value of ¢ has (n — 1) + (пв--1) degrees of freedom. If it exceeds 
the tabular value of ¢ (table 2) at a specified probability level, we would 
reject the hypothesis, The difference between the two means would be 
considered significant (larger than would be expected by chance if there is 
actually no difference), 

In this case, tabular ¢ with 20 degrees of freedom at the 0.05 level is 
2.086. Since our sample value is less than this, the difference is not 
significant at the 0.05 level. 

Requiréments.—One of the unfortunate aspects of the ¢ test and other 
statistical methods is that almost any kind of numbers can be plugged 
into the equations, But if the numbers and methods of obtaining them do 
not meet certain requirements, the resült may be a fancy statistical facade 
with nothing behind it. In а handbook of this scope it is not possible to 
make the reader aware of all of the niceties of statistical usage, but а few 
words.of warning are certainly appropriate. 


25 





А fundamental requirement їп the use of most statistical methods is 
that the experimental material be а random sample of the population to 
which the conclusions are to be applied. In the't test of white pine races, 
the plots should be a sample of the sites on which the pines are to be 
grown, and the planted seedlings should be а random sample representin 
the particular race. A test conducted in one corner of an experimental 
forest may yield conclusions that are valid only for that particular area or 
sites that are about the same. Similarly, if the seedlings of a particular 
race are the progeny of à small number of parents, their performance may 
be representative of those parents only, rather than of the race. 

In addition to assuming that the observations for a given race are а 
valid sample of the population of possible observations, the t test, described 
above assumes that the population of such observations follows the normal 
distribution. With only a few observations, it is usually impossible to 
determine whether or not this assumption has been met. Special studies 
ean be made to check on the distribution, but often the question is left to 
the judgment and knowledge of the research worker. 

Finally, the t test of unpaired plots assumes that each group (or treat- 
ment) has the same population variance. Since it is possible to compute a 
sample variance for each group, this assumption can be checked with 
Bartlett's test for homogeneity of variance. Most statistical textbooks 
present variations of the ¢ test that may be used if the group variances 
are unequal. 


Sample size 


If there is a real difference of D feet between. the two races of white 
pine, how many replicates (plots) would be needed to show that it is 
significant? To answer this, we first assume that the number of replicates 
will be the same for each group (na=nn=n). The equation for t can 
then be written 


iz D оран. 
ра pi 
Ма 


Next we need an estimate of the within-group variance, 8“, As usual, this 
must be determined from previous experiments, or by special study of the 
populations. 

Example.—Suppose that we plan to test at the 0.05 level and wish to 
detect a true difference of D — 1 cord if it exists. From previous tests we 
estimate 8? 5.0. Thus we have 


208 „(5.0 
nape od (2) 

Here we hit a snag. In order to estimate n we need а value for t, but the 
value of t depends on the number of degrees of freedom, which depends on 
n. The situation calls for an iterative solution—a fancy name for trial and 
error. We start with a guessed value of n, вау по=20. Аз Ваз (пл —1) 
+ (пв — 1) + 2(n — 1) degrees of freedom, we'll use t= 2.025 ( = tos with 38 df) 
and compute 


— 22.005) (25 





26 


и 


| 


ва 


e 


Gs 


E 


уна UM 


ЕИ Са 


ми 


Eng 


ка 


The proper value of n will be somewhere between по and m;—much closer 

to n, than to по. We can now make а second guess at n:and repeat the 

роон. If we try п =38, t will-have 2(n—1) =74 df and Ео 1.092. 
lence, 


mansi 50) =30.7 


"Thus, п appears to be over 39 and we will use т = 40 plots for each group 
or a total of 80 plots. 


The t Test for Paired Plots 


А second test was made of the two races of white pine. It also had 11 
replicates of each race, but instead of the two races being assigned com- 
pletely at random over the 22 plots, the plots were grouped into 11 pairs 
and а different race was randomly assigned to each member of a pair. 
The cordwood volumes at the end of the growth period were 


Plot pair 1234 5 6 7 8 9 10 11 Вит Mean 








Race А 12 8 8 1] 10 9 11 H 13 10 7| 110 10.0 
Race B 107 8 9 116 10 H 10 в 9 99 9.0 
dj A,-B 2102-13 1 0 3 2 —2 1 10 


Ав before, we wish to test, the hypothesis that there is no real difference 
between the race means. 
"The value of t when the plots have been paired is 


m with (n—1) degrees of freedom 
ly Va 


n 


where: n=The number of pairs of plots 
517 The variance of the individual differences between A and В 
2 11? 
: za- ZA и ани... +(-2'—тг 
п-1 Е 10 





226 
So, in this example we find 


„199-920 
Шыу 


Comparing this to the tabular value of t (Los with 10 df —2.228), we 
find that the difference is not significant at the 0.05 level. That is, a sample 
mean difference of 1 cord or more could have oceurred by chance more 
than one time in twenty even if there is no real difference between the race 
means. Usually such an outcome is not regarded as sufficiently strong 
evidence to reject the hypothesis. 


22.057 


21 





The paired test will be more sensitive (capable of detecting smaller real 
differences) than the unpaired test whenever.:the experimental units 
(plots in this case) can be grouped into pairs such that the variation be- 
tween pairs is appreciably larger than the variation within pairs. The 
basis for pairing plots may be geographic proximity or similarity in any 
other characteristic that is expected to affect the performance of the plot. 
In animal-husbandry studies, litter mates are often paired, and where 
patches of human skin are the “plots,” the left and right arms may 
constitute the pair. If the experimental units are very homogeneous, there 
may be no advantage in pairing. 


Number of replicates 


The number (n) of plot pairs needed to detect а true mean difference of 
size D is 
ва 
Ср 
N.B.: Be sure to use the variance of the difference (82) between paired 
plots in this equation and not the variance among plots. 


n 


COMPARISON OF TWO OR MORE GROUPS BY 
ANALYSIS OF VARIANCE 


Complete Randomization 


A planter wanted to compare the effects of five site-preparation treat- 
ments on the early height growth of planted pine seedlings. He laid out 25 
plots, and applied each treatment to.5 randomly selected plots. The plots 
were then hand-planted and at the end of 5 years the height of all pines 
was measured and an average height computed for each plot. The plot 
averages (in feet) were as follows: 


Treatments 





| 5 16 вт 
| M 4 13 12 | 
2 з пи 10 12 
3 15 2 12 10 | 
з M е d, mn | 





Sums бт 58 b7 59 |313 


Treatment 
means 134 144 1L6 114 1L8 | 1242 


Looking at the data we see that there are differences among the treat- 
ment means: А and B have higher averages than C, D, and E. Soils and 
planting stock are seldom completely uniform, however, and so we would 
expect some differences even if every plot had been given exactly the same 
site-preparation treatment. The question is, can differences as large as this 


28 





пе 


[s 


ша аш Gm 


mE uU 





su иш ша 


m пи кш um m 





am эй за юш па ва на ощ ва ва нд за на ои на ми 


occur strictly by chance if there is actually no difference among treat- 
ments? If we decide that the ol ferences are larger than might be 
expected to occur strictly by: the inference is that the treatment 
means are not equal. Statistically speaking, we reject the hypothesis of no 
ifference among treatment means. 
Problems like this are neatly handled by an analysis of variance. To 
make this analysis, we need to fill in a table like the following: 











Source 
of of Sums of Mean 
variation freedom squares squares 
4 
20 
24 








Source ој variation.—There are а number of reasons why the height 
of these 25 plots might vary, but only one can definitely 
identified and evaluated—that attributable to treatments, The unidenti- 
fied variation is assumed to represent the variation inherent in the experi- 
mental material and is labeled error. Thus, total variation is being divided 
into two parts: one part attributable to treatments, and the other 
unidentified and called error. 

Degrees of freedom.— Degrees of freedom are hard to explain in non- 
statistical . In the simpler analyses of variance, however, they 
are not difficult to determine. For the total, the degrees of freedom are one 
less than the number of observations: there are 25 plots, so the total has 
24 df's. For the sources, other than error, the df's are one less than the 
number of classes or groups recognized in the source. Thus, in the source 
labeled treatments there are five groups (five treatments), so there will 
be four degrees of freedom for treatments. The remaining degrees of 
freedom (24—4=20) are associated with the error term. 

Sums of squares.—There is a sum of squares associated with every 
source of variation. These 88 are easily calculated in the following steps: 

ич we need what is known as a “correction term" ог С.Т. This is 
simply 


SE 3918.76 


ЕЈ 
CT. Èa) aw 


^ 
where: 27 the sum of n items 
Then the total sum of squares is 
A 
Total SS = 22x: C.T. — 15+ vee EIS) -C.T. = 64.24 
The sum of squares attributable to treatments is 


5 
à È (treatment totals?) 
Treatment а 7 №. of plots per treatment СТ: 


EMT. ‘e t от, ВИ от. Bii] 








— АРЕ рок оч 


Note that in both SS calculations, the number of items squared and added Е 
was one more than the number of degrees of freedom associated with the 
sum of squares. The number of degrees of freedom just below the SS and 
the numbers of items to be squared and added just over the Z, provided a || 
partial check as to whether the proper totals:are being used in the calcula- 
tion—the degrees of freedom must be.one less than the number of items. 

Note also that the divisor'in the treatment SS calculation is equal to [ 
the number of individual items that go to make up each of the totals being E 
squared in the numerator. This was also true in the calculation of total 
SS, but there the divisor was 1 and hence did not have to be shown. Note 
further that the divisor times the number over the summation sign : 
(5Ж5=25 for treatments) must always be equal to the total number of B 
observations in the test-—another check. 

The sum of squares for error is obtained by subtracting the treatment 
SS from total SS. A good habit to get into when obtaining sums of | 
squares by subtraction is to perform the same subtraction using df's. In i 
the more complex designs, doing this provides a partial check on whether | 
the right items are being used. 1 

Mean squares.—The mean squares are now calculated by dividing the 
sums of squares by the associated degrees of freedom. It is not necessary Е 
to caleulate the mean square for the total. | 

The items that have been calculated are entered directly into the 
analysis table, which at the present stage would look like this: 























Source а 88 MS B 
4 34:64 8.66 
20 29.60 1.48 
2 
24 64.24 B | 
An Е test of treatments is now made by dividing the МВ for treatments В 
by the МВ for error. In this case 
8. | 
Р= 715 = 5:851 { 
c 


This figure is compared to the appropriate value of F in table 3 of the 
appendix. Look across the top to the column headed 4 (corresponding to 
the degrees of freedom for treatments). Follow down the column to the 
row labeled 20 (corresponding to the degrees of freedom for error). The 
tabular F for significance at the 0.05 level is 2.87, and that for the 0.01 
level is 4.43. As the calculated value of F exceeds 4.43, we conclude that 
the difference in height growth between treatments is significant at the 
0.01 level. (More precisely, we reject the hypothesis that there is no 
difference in mean height growth between the treatments.) If Е had been | 
smaller than 4.43 but larger than 2.87, we would have said that the 
difference is significant at the 0.05 level. If F had been less than 2.87, we 
would have said that the difference between treatments is not significant, 
at the 0.05 level. The researcher should select his own level of significance 
(preferably in advance of the study), keeping in mind that significance at 
the а (alpha) level (for example) means this: if there is actually no differ- 
ence among treatments, the probability of getting chance differences as 
large as those observed is с. or less. 


30 





и шо m 








с 








иш ма юй пы па ви па ма ми 


ва eu 


ста 


са) 


The t test versus the analysis ој variance.—1f only two treatments are 
being compared, the analysis of variance of a completely randomized 
design and the ! test of unpaired plots lead to the same conclusion. The 
choice of test is strictly one of personal preference, as may be verified by 
applying the analysis of variance to the data used to illustrate the ¢ test of 
unpaired plots. The resulting F value will be equal to the square of the 
value of t that. was obtained (1.е., F—1*). 

Like the t test, the F test is valid only if the variable observed is 
normally distributed and if all groups have the same variance. 


Multiple Comparisons 


In the example illustrating the completely randomized design, the 
difference among treatments was found to be significant at the 0.01 
probability level. This is interesting as far as it goes, but usually we will 
want to take a closer look at the data, making comparisons among various 
combinations of the treatments. 

Suppose, for example, that A and B involved some mechanical form of 
site preparation while C, D, and E were chemical treatments. Then we 
might want to test whether the average of A and B together differed from 
the combined average of C, D, and E. Or, we might wish to test whether 
А and B differ significantly from each other. When the number of replica- 
tions (n) is the same for all treatments, such comparisons are fairly easy 
to define and test. 

The question of whether the average of treatments A and B differs 
significantly from the average of treatments С, D, and E is equivalent to 
testing whether the linear contrast 

Q=(3A+3B)—(20+2D+2E) 
differs significantly from zero (А = the mean for treatment A, еќе.). Note 
that the coefficients of this contrast sum to zero (3+3—2—2—2=0) and 
are selected so as to put the two means in the first group on an equal basis 
with the three means in the second group. 

Similarly, testing whether treatment A differs significantly from treat- 
ment B is the same as testing whether the contrast + А — B differs 
significantly from zero. 


F Test with Single Degree of Freedom 


А comparison specified in advance of the study (on logical grounds and 
before examination of the data) can be tested by an И test with single 
degree of freedom. For the linear contrast 


феай на Ка +... 
among means based on the same number (n) of observations, Ше sum of 
squares has one degree of freedom and is computed as 
„ло 
T Sel 
This sum of squares divided by the mean square for error provides ап Е 


test of the comparison. 
Thus, in testing A and B versus C, D, and E we have 


9=3(13,4) +3(14.4) —2(11.6) -2(11.4) -2(11.8) = 13.8 





ani 


31 








5(13.8)* 952.20 
В зрана зо 034 
Then dividing by the error mean square gives the Ё value for testing the 
contrast. 
31.74 В 
Fg 721.446 with 1 and 20 degrees of freedom 

This exceeds the tabular value of F(4.35) at the 0.05 probability level. If 
this is the level at which we decided to test, we would reject the hypothesis 
that the mean of treatments А and B does not differ from the mean of 
treatments C, D, and E. 

If Q is expressed in terms of the treatrnent totals rather than their 


means so that. 
Qrea(ZX)-Fa(ZX3) ... 
then the equation for the single degree of freedom sum of squares is 


we 


14 nda? 


The results will be the same as those obtained with the means. For the 
test of A and B versus C, D, and Е, 


Qr=3(67) -3(72) — 2(58) — 2(57) ~2(59) = 69 


And, 
88 заа DE „4761 
ги 58%++3°-_(—2)#-Е(—2)#+(—2)1] 150 
Working with the totals saves the labor of computing means and avoids 
possible rounding errors. 





731.74, as before. 


Scheffé's Test 


Quite often we will want to test comparisons that were not anticipated 
before the data were collected. If the test of treatments was significant, 
such unplanned comparisons can be tested by the method of Scheffé. 
When there are п replications of each treatment, К degrees of freedom for 
treatment, and v degrees of freedom for error, any linear contrast among 
the treatment means 

ф-а ita +... 
is tested by computing 


"EP": 


^ ква?) (Error mean square) 


This value is then compared to the tabular value of F with k and v degrees 
of freedom. 


For example, to test treatment B against the means of treatments C 
and E we would have 


Q-(2B— (C--E)] -[204.4) —11.6—11.8] 25.4 
and 


шз ао ФО. „э, i 
P= Diam Caas, with 4 and 20 degrees of freedom 


32 





L 
ü 


МЕ ма ва 


Се 


Cx 





эй юй ма 


am ий юй иш ија 


This figure is larger than the tabular value of F (=2.87), and so in testing 
at the 0.05 level we would reject the hypothesis that the mean for treat- 
ment B did not differ from the combined average of treatments C and E. 

For a contrast (Qr) expressed in terms of treatment totals, the equation 


for Ё becomes 
соби ыы 


F- 
nk(Xai)(Error mean square) 


Unequal Replication 


If the number of replications is not the same for all treatments, then 
for the linear contrast 


QsaXi aX... 


the sum of squares in the single degree of freedom F test is given by 


Л 
MT ai 9 А 
(242+...) 


where: п; the number of replications on which X, is based. 
With unequal replication, the F value in Scheffé's test is computed by 
the equation 
š @ 


а а " к 
(®)(—-+- +... (Error mean square) 
mn 





Selecting the coefficients (a;) for such contrasts can be tricky. When 
testing the hypothesis that there is no difference between the means of two 
groups of treatments, the positive coefficients are usually 


positive а „н 
гр 


where р = ће total number of plots in the group of treatments with pos- 
itive coefficients, 
The negative coefficients are 


В n. 
negative а; = = 


where m=the total number of plots in the group of treatments with 
negative coefficients. 

To illustrate, if we wish to compare the mean of treatments A, B, and C 
with the mean of treatments D and E and there are two plots of treatment, 
A, three of B, five of C, three of D, and two of E, then p=2+3+5=10, 
m=3+2=5 and the contrast would be 


Q= (иначе) (0+) 


38 





m E О OBIIT 


| | 
Randomized Block Design a | 


In the completely randomized design the error mean square is a measure ј 
of the variation among plots treated alike. It is in fact an average of the {| 
within-treatment variances, ав may easily be verified by computation. If | 
there is considerable variation among plots treated alike, the error mean 
square will be large and the F test for a given set of treatments is less 
likely to be significant. Only large differences among treatments will be 
detected as real and the experiment is said to be insensitive. 

Often the error can be reduced (thus giving а. more sensitive test) Бу 
use of a randomized block design in place of complete randomization. In 
this design, similar plots or plots that are close together are grouped into | 
blocks. Usually the number of plots in each block is the same as the | 
number of treatments to be compared, though there are variations having | 
two or more plots per treatment in each blo ek. The blocks are recognized 
as a source of variation that is isolated in the analysis. 

As an example, a randomized block design with five blocks was used to 
test the height growth of cottonwood cuttings from four selected parent 
trees. The field layout looked like this: 





a 
о 








С 

















= 
Е 
2 
< 
с 


Each plot consisted of а planting of 100 cuttings of the clone assigned to 
that plot. When the trees were 5 years old the heights of all survivors 
were measured and an average computed for each plot. 


СЕ 


The plot averages (in feet) by clones and blocks are summarized below: 


Се 





| Clone Block 
Block | A В С D | totals 





ms 





HI, 6 15 8 15 | 54 ш | 

IV| M 12 10 12 | 48 E | 

vii м 9 мо | 
Clone | 


totals | 75 70 55 70 | 270 


Clone 
means| 15 14 и 14 


Са 








The hypothesis to be tested is that clones do not differ in mean height. 

In this design there are two identifiable sources of variation—that 
attributable to clones and that associated with blocks. The remaining 
portion of the total variation is used as a measure of experimental error. 
The outline of the analysis is therefore as follows: 


34 


= m 


Ez 














Source of Sumsof Mean 
variation dt squares © squares 














The breakdown in degrees of freedom and computation of the various 
sums of squares follow the same pattern as in the completely randomized 
деец. ‘otal degrees of freedom (19) are one less than the total number 
of plots, Degrees of freedom for clones (3) are one less than the number 
of clones. ith five blocks, there will be four degrees of freedom for 
blocks, The remaining 12 degrees of freedom. are. associated with the 
error term. 

Sums-of-squares calculations proceed as follows: 


1. The correction term 


20 * 
on A) ID 23,645 





20 
2. Total 88 = 2X1 C.T. (481418 .. +4) -6.т. 


73,766 —3,645 = [121] 





4 
Z (Clone totals?) _ 
No. of plots per clone “`` 


= Дай жали -ст. 


3, Clone 88 = 
з 


=3,690—3,645 = [45] 


5 
* Gloek totals?) _ 
No. of plots рег block 


_ 60-59... +49 _ 
4 


4. Block 88 ст 
4 dr 


С.Т. 


=3,675.5—3,645 = [80.5 
5. Error 88 =Total SS —Clone 88 —Block SS = 45.5] 
124 19 df 8 df 44 


Note that in obtaining the error 88 by subtraction, we get a partial check 
on ourselves by subtracting clone and block df's from the total df to see if 
we come out with the correct number for error df. НЕ these don't check, 
we have probably used the wrong sums of squares in the subtraction. 
Mean squares are again calculated by dividing the sums of squares by 
the associated numberof degrees of freedom. 
Tabulating the results of these computations 


35 














Source df 88 м8 
4 30.5 7.625 
3 45.0 15.000 
12 45.5 3.792 
TM ына нна 19 1210 





F for clones is obtained by dividing clone MS by error MS. In this case 
P = 10.000 3.956, As this is larger than the tabular Р of 3.49 (Fos with 
3 and 12 degrees of freedom) we conclude that the difference between 
clones is significant at the 0.05 level. The significance appears to be due 
largely to the low value of C as compared to A, B, and D. 

Comparisons among clone means can be made by the methods previ- 
ously described. For example, to test the prespecified (i.e., before examin- 
ing the data) hypothesis that there is no difference between the mean of 
clone C and the combined average of A, B, and D we would have: 

_ 9586-A-B-D)* 
By or (A+B+D vs ©) scc Der pec (705 








Then, 








_ N.B.: With only two treatments, the analysis of variance of a random- 
ized block design is equivalent to the ¢ test of paired replicates.:The value 


36 





mE ши НЕ ма 


иш me аш 


ща эш NN пи 





ки mp аш 


ра 





ma mn 


ma amm 


ma ens ва ва 


mua sm 


ша a е ш ES 


of F will be equal to the value of И and the inferences derived from the 
tests will be the same. The choice of tests is a matter of personal preference. 


Latin Square Design 


In the randomized block design the purpose of blocking is to isolate a 
recognizable extraneous source of variation. If successful, blocking re- 
duces the error mean square and hence gives а more sensitive test than 
could be obtained by complete randomization. 

In some situations, however, we have а two-way source of variation that 
cannot be isolated by blocks alone. In a field, for example, fertility 
gradients may exist both parallel to and at right angles to plowed rows. 

imple blocking isolates only one of these sources of variation, leaving the 
other to swell the error term and reduce the sensitivity of the test. 

When such a two-way source of extraneous variation is recognized or 
suspected, the Latin square design may be helpful. In this design, the total 
number of plots or experimental units is made equal to the square of the 
number of treatments. In forestry and agricultural experiments, the 
plots are often (but not always) arranged in rows and columns with each 
row and column having a number of plots equal to the number of treat- 
ments being tested. The rows represent different levels of one source of 
extraneous variation while the columns represent different levels of the 
other source of extraneous variation. Thus, before the assignment of 
treatments, the field layout of a Latin square for testing five treatments 
might look like this: 


COLUMNS 
! 2 ЕЈ 4 


< 


"Treatments are assigned to plots at random, but with the very im- 
portant restriction that а given treatment cannot appear more than once 
in any row or any column. 

An example of a field layout of a Latin square for testing five treat- 
ments is given below. The letters represent the assignment of five 
treatments (which here are five species of hardwoods). The numbers 
show the average 5-year height growth by plots. The tabulation shows the 
totals for rows, columns, and treatments. 


87 





COLUMNS 
1 2 3 4 5 





Row, column, and treatment totals 


Row = | Column х | Treatment > x 











1 1 83 A 95 19 

2 8| 2 85 в ю 16 

з 80 3 8| С 75 15 

4 8| 4 81 D 85 17 

5 aj 5 77 E 80 16 
| 

5545 | X = £5 | Z = 415 166 


The partitioning of df's, the calculation of sums of squares, and the 
subsequent analysis follow much the same pattern illustrated previously 
for randomized blocks. 


(Ex) 415: 172225 
сте 357 g бави 


26 


Total SS = 25Х*—С.Т.=7,041—С.Т. 0) 
E 





X (Row totals?) 34,529 Што 
Row 83 “Wo. of plots per row ^77. 5 жы Ыы 





5 
E (Column totale) —— 34,525 


Со. 88 = No. of plots per column С.Т.= 5 —С.Т.=16.0 











А 
Species 88 = L Gpecies totals”) С 34879 o.r, 050 


No. of plots per species 





Error SS =Total 85 — Species 88 —Row SS —Col. 88 —[73.2 
12 dt 24 df за +4 adt 


38 


С 


ЕЗ ЕЧ Е 


ER) au) mm 


ES) кш 


ша am 


Analysis of variance 














Source df 58 MS 
4 16.8 42 
4 16.0 40 
4 46.0 11.5 
12 73.2 6.1 
Том... 24 152.0 





Р (Gor species) «51,885 





Ав thé computed value of И is less than the tabular value of F at the 0.05 
level (with 4/12 df's) the differences among species are considered 
nonsignificant. 

The Latin square design can be used whenever there is a two-way 
heterogeneity that cannot be controlled simply by blocking. In greenhouse 
studies, distance from a window could be treated as a row effect while 
distance from the blower or heater might be regarded as a column effect. 
‘Though the plots are often physically arranged in rows ог columns, this 
is not required. In testing the use of materials in а manufacturing process 
where different machines and machine operators will be involved, the 
variation between machines could be treated as a row effect and the 
variation due to operator as а column effect. 

The Latin square should not be used if an interaction between rows and 
treatments or columns and treatments is suspected. 


Factorial Experiments 


In a comparison of corn yields following three rates or levels of nitrogen 
fertilization it was found that the yields depended on how much phos- 
phorus was used along with the nitrogen. The differences in yield were 
smaller when no phosphorus was used than when thenitrogen applications 
were accompanied by 100 pounds per acre of phosphorus. In statistics 
this situation is referred to as an interaction between nitrogen and phos- 
phorus. Another example: when leaf litter was removed from the forest 
floor, the catch of pine seedlings was much greater than when the litter 
was not removed; but for red oak the reverse was true—the seedling catch 
was lower where litter was removed. Thus, species and litter treatment 
were interacting. 

Interactions are important in the interpretation of study results. In 
the presence of an interaction between species and litter treatment it 
obviously makes no sense to talk about the effects of litter removal with- 
out specifying the species. The nitrogen-phosphorus interaction means 
that it may be misleading to recommend a level of nitrogen without 
mentioning the associated level of phosphorus. 

Factorial experiments are aimed at evaluating known or suspected 
interactions, In these experiments, each factor to be studied is tested at 
several levels and each level of a factor is tested at all possible combina- 
tions of the levels of the other factors. In a planting test involving three 
species of trees and four methods of preplanting site preparation, each 
method will be applied to each species, and the total number of treatment 


39 








combinations will be 12. In a factorial test of the effects of two nursery 
treatments on the survival of four species of pine planted by three differ- 
ent methods, there would be 24 (2X4X3=24) treatment combinations. 
The method of analysis can be illustrated by a factorial test of the 
effects of three levels of nitrogen fertilization (0, 100, and 200 pounds per 
acre) on the growth of three species (A, B, and C) of planted pine. The 
nine possible treatment, combinations were assigned at random to nine 
plots in each of three blocks. Treatments were evaluated on the basis of 
average annual height growth in inches per year over a 3-year period. 
Field layout and plot data were as follows (with subscripts denoting 


nitrogen levels: 00, 1— 100, 2= 200): 
Block I Block II 





The preliminary analysis of the nine combinations (temporarily ignor- 
ing their faetorial nature) is made just as though this were a straight 
randomized block design (which is exactly what it is). (See table, p. 41.) 


Sums of squares 


(Sx) вав: 





C.T. e MS e 0 11,71. 3104 
27 
Total 88 =UX?=C.7.= (4541744... +22) СЛ. 
с 
2.215.6296) 


à 
_ _ (Вюск totals?) А 
Block ва “No. of plots per bod СТ 


GN PREMO — C.T. = [11.6296] 





40 


9 


E 


[pz 


ка 


ш бё 


ш 


ја 


Се 


ca 


с 


= 


U 





cin: 





АШ um mà sm um ма mà тоа ED ии UR эщ ои т UM 


























Summary of plot data. 
ее M 
Ni Blocks Ni Species 
Species vd" |1 PH^ m | ам» Baie 
А ° 45 оа 12 
1 7 18 2 63 
2 а и mW 55 
Block 
subtotals | 86 72 а 240 
в 0 7: в 29 88 
1 2 в 19 8 
2 в 2 в 56 
Block гърди 
subtotals | 03 68 202 
с 0 там 19 
1 20 25 10 в 
2 17 20 21 58 
Block * 
subtotals | 74 8 79 за 
АП species 0 10 18 105 | 329 
1 во б 185 Grand 
2 » и № 169 total 
Totals | 23 28 2A 083 








9 
È (Treatment totals?) 
T rim No. of plots per treatment ст. 


ННВ $589 op, = 1670 2068 


Error 88, =Total 88 -Treatment 88 — Block SS = [293.7037] 











Tabulating these in the usual form: 
Souree dt 8S MS 
2 11.6296 5.8148 
8 19702963 2462870 
16 293.7037 18.3565 
Томи. ....-------- 26 2,275.6296 


Testing treatments: Ралле i249 25900 2870 _ 18, 417, significant at the 0.01 
level 18.3565 

The next step is to analyze the components of the treatment variability. 
How do the species compare? What is the effect of fertilization? And 


does fertilization affect all species the same way (i.e., is there a species- 
ni! interaction)? То answer these questions we have to partition 
the of freedom and sums of squares associated with treatments. 


This is easily done by summarizing the data for the nine combinations in 
a two-way table. 


41 








Species 


ou» 


Totals 








The nine individual values will be recognized as those that entered into 
the calculation of the treatment SS. Keeping in mind that each entry in 
the body of the table is the sum of three ро values, and that the species 
and nitrogen totals are each the sum of 9 plots, the sums of squares for 
species, nitrogen, and the species-nitrogen interaction can be computed 
as follows: 


Treatment. 88 = 1,070.2963 (as previously calculated) 


3 
р ___ È (species totals)? 
Species 88 =No. of plots per ‘species СТ: 
(2401420272411) 
— — С.Т. 





156,485 И 
= C.T. = |109:8518 


3 

E Nitrogen totals)? от, 
No. of plots per level of nitrogen ^" 
= (8202+ 185 +169) Ср, 171,027 
» 9 тер 


Nitrogen 88 = 


-с.т. 








 1,725.6296 











Species-nitrogen interaction 8 = Treatment, 88 —Species 8 
Й 
-Nitrogen 88 = [1348149 














The analysis now becomes: 











Source df 88 м8 Е 

2 11.6206. 5.8148 вето 

8 1,970.2963 2462870 | — 1341T** 
12 109.8518. 54.9260 2.99255 
12 11,725.6296 862.8148 47.003** 
14 1 134.8149. 7037 1.8368 

16 293.7037 18.3565 

26 2,275.0296 








* Offset figures are а. itioning of the df's and sum of squares for Treatments, and 
are therefore not included in the at the bottom of the table. 


42 


ъ ' 


НА шы uM эш ШЕ эш он ша иш эш ва иш ва эш ип иш 





pes) 


ый шй oh mé ий ND шщ иш иш иш иш 


c3 


ub & s) 





The degrees of freedom for simple interactions can be obtained in two 
ways. The first way is by subtracting the df’s associated with the com- 
пері factors (in this case two for ене and two for nitrogen levels) 
тот the df’s associated with all possible treatment combinations (eight in 
this case). The second way is to calculate the interaction df’s as the 
oduct of the component factor dí's (in this case 2X2=4). Do it 
Both ways as a check. 

The F values for species, nitrogen, and the species-nitrogen interaction 
are calculated by dividing their mean squares by the mean square for 
error. In the above tabulation, last column, NS indicates nonsignificant 
and ** means significant at the 0.01 level. 

The analysis indicates a significant difference among levels of nitrogen, 
but no difference between species and no species-nitrogen interaction. 


As before, a prespecified comparison among treatment means can be 
tested by breaking out the sum of squares associated with that compari- 
son. To illustrate the computations, we will test nitrogen versus no 
nitrogen and also 100 pounds versus 200 pounds of nitrogen. 


(2629) Te 1,711.4074 


Nitrogen vs. no nitrogen 88 - 





In.the numerator the mean for the zero level of nitrogen 
is multiplied by 2 to give it equal weight with the mean 
of levels 1 and 2 with which it is compared. The 9 is 
the number of plots on which each mean is based. The 
(22--1?--12) in the denominator is the sum of squares 
of the coefficients used in the numerator. 


pO] ° 085-189 


2 
100 vs. 200 pounds 8$ - ат) 909) P". 14.2222] 





Note that these two sums of squares (1,711.4074 and 14.2222), each 
with 1 df, add up to the sum of squares for nitrogen (1,725.6296) with 
2 df's. This additive characteristic holds true only if the indivudual df 
comparisons selected are orthogonal (i.e., independent). When the num- 
ber of observations is the same for all treatments, the orthogonality of any 
two comparisons can be checked in the following manner: First, tabulate 
the coefficients and check to see that for each comparison the coefficients 
sum to zero. 





Nitrogen level 
1 2 








Comparison Sum 
2No ув. NN: 2 - - 0 
Ni vs. Na 0 + - о 

Product of coefficients | 0 - + 0 





48 








Then for two comparisons to be orthogonal the sum of the products of 
corresponding coefficients must be zero. 'Any sum of squares can be 
partitioned in a similar manner, with the number of possible orthogonal 
individual df comparisons being equal to the total number of degrees of 
freedom with which the sum of squares is associated 

The sum of squares for species can also be partitioned into two orthog- 
onal single df comparisons. If the comparisons were specified before the 
data were examined, we might make single df tests of the difference be- 
tween B and the average of A and C and also of the difference between A 
and C. The method is the same as that illustrated in the comparison of 
nitrogen treatments. The calculations are as follows: 


TEET 


PTR) 
_1940-4-241—2(202)}* порт 


98) 09.7988] 
са of (59) (591 Васа. 
Ре 


2B vs. (А+С) 88 = 

















These comparisons are orthogonal, so that the sums of squares each 
with one df add up to the species SS with two df’s. 

Note that in computing the sums of squares for the single-degree-of- 
freedom comparisons, the equations have been restated in terms of treat- 
ment totals rather than means. This often simplifies the computations 
and reduces the errors due to rounding. 

With the partitioning the analysis has become: 





Source df 88 мв Е 


Blocks- 





2.99258 
5.981“ 





2 
1 
1 


Species X nitrogen 
interaction. 


33.7037 1.836N5 
18.3505 











We conclude that species B is poorer than A or C and that there is no 
difference in growth between A and C. We also conclude that nitrogen 
adversely affected growth and that 100 pounds was about as bad as 200 
pounds, The nitrogen effect was about the same for all species (i.e., no 
interaction). 

It is worth repeating that the comparisons to be made in an analysis 
should, whenever possible, be planned and specified prior to an examina- 
tion of the data. А good procedure is to outline the analysis, putting in 
all the items that are to appear in the first two columns (source and df) 
of the table. In the above tabulation, last column, * means significant 


44 





ЕЕ 





ип uH 


бй Си 


Се 





СО (бй Ба 


БО = 


ста 











at ће 0.05 level. Asin the previous table, ** means significant at the 
0.01 level, and NS means nonsignificant. 

The factorial experiment, it will be noted, is not an experimental 
design. 1% is, instead, a way of selecting treatments; given two or more 
factors each at two or more levels, the treatments are all possible com- 
binations of the levels of each factor. И we have three factors with the 
first at four levels, the second at two levels, and the third at three levels, 
we will have 4X2X3- 24 factorial combinations or treatments. Fac- 
torial experiments may be conducted in any of the standard designs. The 
randomized block and split plot design are the most common for factorial 
experiments in forest research. 


The Split Plot Design 


When two or more types of treatment are applied in factorial combina- 
tions, it may be that one type can be applied on relatively small plots 
while the other type is best applied to larger plots. Rather than make all 
plots of the size needed for the second type, a split-plot design can be 
employed. In this design, the major (large-plot) treatments are applied 
to a number of plots with replication accomplished through any of the 
common designs (such as complete randomization, randomized blocks, 
Latin square), Each major plot is then split into a number of subplots, 
equal to the number of minor (small-plot) treatments. Minor treatments 
are assigned at random to subplots within each major plot. 

As an example, a test was to be made of direct seeding of loblolly pine 
at ых different dates, on burned and unburned seedbeds. То get typical 
burn effects, major plots 6 acres in size were selected. There were to be 
four replications of major treatments in randomized blocks, Each major 
plot was divided into six 1-acre subplots for seeding at six dates. The field 
layout was somewhat as follows (blocks denoted by Roman numerals, 
burning treatment by capital letters, date of seeding by small letters): 

















45 


One pound of seed was sowed on each 1-асге subplot. Seedling counts 
were made at the end of the first growing season. Results were as follows: 


Summary of seedlings per acre 








Date 
1 п ш IV subtotals 


Date | A В А В А В|А в А В || totals 











a | 900 880 810 1,100) 760 9601040 1.040] 3,510 3,980) 7,490 
b | 880 10501170 1240/1,060 1,110] 910 1,120] 4020 4,520) 8,540 
с [1530 11401160 12701390 1,320,540 1,080| 5,620 4810 10,430 
d 11970 13601800 1,510/,820 1,490,140 1,270| 7,820 5,630 13,450 
e |1960 12701670 1,380],310 1;5001,480 1,450) 6,420 — 5,600| 12,020 
г 830 150) 420 '380 570 420 760 270| 2,580 1,220] 3,800 

Mejor 
lot 

totals |8070 58507120 6,880/6,910 6,80017,870 6,230120,970 25,760 

Вюск 





я 
ЕЧ 
A3 
5 


totals 13,920 14,000 13,710 14,100 





























Calculations.—The correction term and total sum of squares are 
calculated using the 48 subplot values. 
(Grand total of all subplots)? _ 55,730? _ 

Total number of subplots 48 64,704,852 





С.Т.= 


48 
Total 88, 2, (Subplot values?) -С.Т. 
Е 


= (900?+880?+ . . . +270?) — C.T. = [9,339,048] 


Before partitioning the total sum of squares into its components, it may 
be instructive to ignore subplots for the moment, and examine the major 
plot phase of the study. The major phase can be viewed as a straight 
randomized block design with two burning treatments in each of four 
blocks. The analysis would be: 


Source df 
Blocks 3 
Burning 1 
Error (major plots) 3 
Major plots 7 


Now, looking at the subplots, we can think of the major plots as blocks. 
From this standpoint we would have а randomized block design with six 
dates of treatment in each of eight, blocks (major plots) for which the 


analysis is: 
Source df 
Major-plots 7 
Dates 5 
Remainder 35 
Subplots (— Total) 47 
46 





8B us uM usé GN 


uw иш umi 








НЕ mg кш mu 


= 


In this analysis, the remainder is made up of two components. One of 
these is the burning-date interaction, with five df’s. The rest, with 30 
df's, is called the subplot error. Thus, the complete breakdown of the 
split-plot design is; 





Source af 

Blocks 3 

Burning 1) Major plots—7 df 

Major plot error 3 

pue à 5} Dates—5 df 
urning X date 5 P 

Subplot error 30 Remainder—35 df 


Total 47 


The various sums of squares are obtained in an analogous manner. We 
first. compute 


8 
+ Е E (Major plot totals?) - 
Major plot 88 Number of subplots per major plot E 


„80708... +6,280° 
6 


4 
_ È (Block totals?) 
Block 88 —gibplots per block 


13,920*-- ... +14,100* 
— — 1. 6,856 


—C.T.=647,498 


-с.Т. 


2 
А У (Burning treatment totals?) 
Burning м = Subplots per burning treatment СТ: 


90 T Ср 360,252 





Major-plot error 88 = Major plot SS -- Block SS 
за та за 
— Burning 88 = 271,390, 
Subplot SS =Total SS — Major plot 88 =8,692,150 
40 dt 47 dt та 
6 
Date вв = 2- (Date totals?) 


ва Subplots per date 


Rm. +8800" С.Т. = [7,500,086] 


С.Т. 


Date-burning 55: 
To get the sum of squares for the interaction between date and burning 
we resort to a factorial experiment device—the two-way table of the 
treatment combination totals. 


47 








Date Burning 

Burning a b с 4 е { subtotal 
A 3,810 4020 5620 7,820 6420 2,580 | 29,970 

B 3,980 4,520 480 5,620 5,600 1,220 | 25,760 





Date 
subtotals 7,490 8540 10430 13,450 12,020 3,800 55,730 


12 
2 (Date-burning combination totals?) 
Subplots per date-burning combination 





Date-burning subclass 88 = 
п 
SOT С.Т. = 8,555,723 
Date-burning interaction 88 < Date-burning subclass 88, 
— Date 88 — Burning 88 = [686,385] 
Subplot error SS =Subplot SS —Date 88 
30 df 40 df 5. 
— Date-burning interaction 88 = [505,679 


"Thus the completed analysis table is 





Source dt 88 MS 





6,856 _ 
360252 369,252 
271,90 90,463 

7,500,086 17 
686, 
505,679 

[o EPIS 47 — 9,339,648 


Еј 











The Е test for burning is 
F " Burning MS „369,252 4.082, not significant 
"ам Major-plot error MS 90,463 at the 0,05 level. 
For dates, 
Date MS 
Subplot error Mi 






500,017 . 88.99, significant at 


Рум 56 = the 0.01 level. 





1 
And for the date-burning interaction, 


F _ Date-burning MS 137 277 _ 8.14, si nificant at the 
"UU Subplot error MS 16,856 0.01 level. 





Note that the major-plot error is used to test the sources above the 
dashed line while the subplot error is used for the sources below the line. 
Because the subplot error is a measure of random variation within major 


48 


ше иш 








ип EM иш 


а 





Em ип mu 


E иш 


plots it will usually be smaller than the major-plot error, which is a 
measure of the random variation between major plots. In addition to being 
smaller, the subplot error will generally have more degrees of freedom. 
than the major-plot error, and for these reasons the sources below the 
dashed line will usually be tested with greater sensitivity than the sources 
above the line. This fact is important; in planning a split-plot experi- 
ment the designer should try to get the items of greatest interest below 
the line rather than above. 

Rarely will the major-plot error be appreciably smaller than the sub- 
lot error, If it is, the conduct of the study and the computations should 
е carefully examined. 

Subplots can also be split.—1f desired, the subplots can also be split for 

a third level of treatment, producing a split-split-plot design. The 
calculations follow the same general pattern but are more involved. A 
split-split-plot design has three separate error terms. 

Comparisons among means in а split-plot design.—For comparisons 
among major- or subplot treatments, И tests with a single degree of free- 
dom may be made in the usual manner. Comparisons among major-plot 
treatments should be tested against the major-plot error mean square, 
while subplot treatment comparisons are tested against the subplot error. 
In addition, it is sometimes desirable to compare the means of two treat- 
ment combinations. This can get tricky, for the variation among such 
means may contain more than one source of error. A few of the more 
common cases are discussed below. 

In general, the ¢ test for comparing two equally replicated treatment 
means is 


Mean difference 2р 
“Standard error of the mean difference 85 





1, For the difference between two major-treatment means: 
ap) aloe pot error MB) (М: lot атг М8) 
(т) (В: 





; t has df equal to. the df for 
the major-plot error. 


where: А = Number of replications of major treatments. 
m Number of subplots per major plot. 


2. For the difference between two minor-treatment means: 
p= 2@ubplet erro error М8). ; t has df equal to the df for 
GOD subplot error. 


where: М =Number of major-plot treatments. 





3. For the difference between two minor treatments within a single 
major treatment: 
- Ро error MS), dt tor t= 


p df for the subplot error 
4. For the difference between the means of two major treatments at 


a single level of a minor treatment, or between the means of two 
major treatments at different levels of a minor treatment: 


w=, (m—1)(Subplot error MS) +-Major-plot error м8] 
с TMR) 








49 





In this case, ¢ will not follow the ¢ distribution. A close 
approximation to the value of ¢ required for significance 
at the a level is given by 

., (m—1)(Subplot error MS)t-+(Major-plot error MS)ty 


i (m—1)(Subplot error MS)+(Major-plot error MS) 





where: 

„= Tabular value of t at the а level for df equal to the df for the 
subplot error. 

1м = Tabular value of t at the a level for df equal to the df for the 
major-plot error. 

Other symbols are as previously defined. 


Missing Plots 


А mathematician who had developed a complex electronic computer 
foem for analyzing а wide variety of experimental designs was asked 

ow he handled missing plots. Ніз disdainful reply was, “We tell our 
research workers not to have missing plots.” 

This is good advice. But it is sometimes hard to follow, and particu- 
larly so in forest research, where close control over experimental material 
is difficult and studies may run for several. years. 

The likelihood of plots being lost during the course of a study should be 
considered when selecting an experimental design. Lost plots are least 
troublesome in the simple designs. ‘For this reason, complete randomiza- 
tion and randomized blocks may be preferable to the more intricate 
designs when missing data сап be expected. 

In the complete randomization design, loss of one or more plots causes 
no computational difficulties. The analysis is made as though the 
pissing plots never existed. Of course, a degree of freedom will be lost 
from the total and error terms for each missing plot and the sensitivity of 
the test will be reduced. If missing plots are likely, the number of 
replications should be increased accordingly. 

In the randomized block design, completion of the analysis will usually 
require an estimate of the values for the missing plots. A single missing 
value can be estimated by 
x- BHT- 

(—1)(t~1) 
where: b=Number of blocks 
= Number of treatments 
В = Total of all other units in the block with a missing plot 
T =Total of all other units that received the same treatment as 

the missing plot 
С = Total of all observed units 


If more than one plot is missing, the customary procedure is to insert 
guessed values for all but one of the missing units, which is then estimated 
by the above formula. This estimate is used in obtaining an estimated 
value for one of the guessed plots, and so on through each missing unit. 
Then the process is repeated with the first estimates replacing the guessed 
values. The cycle should be repeated unti] the new approximations differ 
little from the previous estimates. р 

"The estimated values are now applied in the usual analysis-of-variance 
calculations. For each missing unit one degree of freedom is deducted 
from the total and from the error term. 





50 


E 





ша НЕ mH 


пад 


EM и иш 


m uM  ш NE 


А similar procedure is used with the Latin square design, but the formu- 
la for a missing plot is 
x (Ut CT) 28 
@—1)(т—2) 
where: г= Number of rows 
Е = Total of all observed units in the row with the missing plot 
C = Total of all observed units in the column with the missing plot. 
T =Total of all observed units in the missing plot treatment 
С = Grand total of all observed units 


With the split-plot design, missing plots can cause trouble. A single 
missing subplot value can be estimated by the equation 


_ТРЕтТТ (Тв) 
(0 (r=1)(m=1) 


where: r- Number of replications of major-plot treatments 
P=Total of all observed subplots in the major plot having a 
missing subplot 
m= Number of subplot treatments 
Т Total of all subplots having the same treatment combination 
as the missing unit 
T. — Total of all subplots having the same major-plot treatment 
ав the missing unit 


For more than one missing subplot the iterative process described for 
randomized blocks must be used. In the analysis, one df will be deducted 
from the total and subplot error terms for each missing subplot. 

When data for missing plots are estimated, the treatment mean square 
for all designs is biased upwards. If the proportion of missing plots is 
small, the bias can usually be ignored. Where the proportion is large, 
adjustments сап Бе made as described in the standard references on ex- 
perimental designs. 


REGRESSION 





Simple Linear Веди 


A forester had an idea that he could tell how well a loblolly pine was 
growing from the volume of the crown. Very simple: big crown— 
wth, small crown—poor growth. But he couldn’t say how big and 
iow good, or how small and how poor. What he needed was regression 
analysis: it would enable him to express a relationship between tree growth 
and crown volume in an equation. Given a certain crown volume, he 
could use the equation to predict what the tree growth was. 

To gather data, he ran parallel survey lines across a large tract that was 
representative of the area in which he was interested. The lines were 5 
chains apart. At each 2-chain mark along the lines, he measured the 
nearest loblolly pine of at least 5.6 inches d.b.h. for crown volume and 
basal area growth over the past 10 years. 

A portion of the data is printed below to illustrate the methods of 
calculation. Crown volume in hundreds of cubic feet is labeled X and 
basal area growth in square feet is labeled Y. Now, what can we tell the 
forester about the relationship? 


61 


























x x 
Crown Ү Crown Y Crown 
volume Growth volume Growth volume Growth 

51 Al 
75 66 

6 AB 
20 21 
36 29 
50 -56 

9 13 

2 10 
21 18 
17 17 
87 68 
97 86 
33 AB 
20 06 
96 58 
61 42 

3,050 26.62 
49.1935 0.42935 














Often, the first step is to plot the field data on coordinate paper (fig. 1). 
"This is done to provide some visual evidence of whether the two variables 
are related. If there is a simple relationship, the plotted points will tend 
to form а pattern (a straight line or curve). If the relationship is very 
strong, the pattern will generally be distinct. If the relationship is weak, 
the points will be more spread out and the pattern less definite, If the 
points appear to fall pretty much at random, there may be no simple 
relationship or one that is so very poor as to make it a waste of time to fit 
any regression. 

The type of pattern (straight line, parabolic curve, exponential curve, 
etc.) will influence the regression model to be fitted. In this particular 
case, we will assume a simple straight-line relationship. 

After selecting the model to be fitted, the next step will be to calculate 
the corrected sums of squares and products. In the following equations, 
capital letters indicate uncorrected values of the variables; lower-case 
letters will be used for the corrected values (y= Y — Y). 


ae 
(Èr) 
n 
= (0.36?+0,092+- ... +0.42?) 
_ 26.62? 
62 


7826) 


n 
The corrected sum of squares for Y: Dy*= 22Y!— 








52 


[ ЕИ 





аш ша БО] mm 


ИШ НЕ а ищ 


VEGNONTM 


73 


2 
The corrected sum of squares for X: Уа Dx 020) 


=@2+6+... +61) 2.050" 


59,397.6775) 











The corrected sum of products: Yxy- È (XY)- Gx)Ey) 








=[(22)(.36) 4- (6) (.09) +... 
(61) (42- 68005.62) 
= 854.1477] 








The general form of equation for a straight line is Y=a+bX 


In this equation, a and b are constants or regression coefficients that must. 
be estimated. According to the principle of least squares, the best esti- 
mates of these coefficients are: 


Zey 354.1477 
Yat” 50,397.6775 


а= Y —bX =0.42935- (0.005962) (49.1935) = 0.13606 
Substituting these estimates in the general equation gives 

P =0.13606 +0.005962X 
where Ў isused to indicate that we are dealing with an estimated valueof Y. 
8 


b =0.005962 








20 40 60 80 100 
Хе CROWN VOLUME 


FiounE 1.—Plotting of growth (Y) over crown volume (X). 


58 





With this equation we can estimate the basal area growth for the past 
10 years (Y) from the measurements of the crown volume X. 

Because Y is estimated from a known value of X, it is called the de- 
pendent variable and X the independent variable. In plotting on graph 
paper, the values of Y are usually (purely by convention) plotted along 
the vertical axis (ordinate) and the values of X along the horizontal axis 
(abscissa). 


How Well Does the Regression Line Fit the Data? 


А regression line can be thought of as a moving average. It gives an 
average value of Y associated with а particular value of X. Of course, 
some values of Y will be above the regression line (or moving average) an 
some below, just as some values of У are above or below the general 
average of У. 

The corrected sum of squares for Y (i.e., Ху?) estimates the amount of 
variation of individual values of Y about the mean value of Y. А regres- 
sion equation is a statement that part of the observed variation in Y 
(estimated Бу Зу?) is associated with the relationship of Y to X. The 
amount of variation in Y that is associated with the regression on X is 
called the reduction or regression sum of squares. 


(Zzy)? | (354.1477)* 
Ez! (59,397.6775) 


As noted above, the total variation in Y is estimated by 2y!-:2.7826 
(as previously calculated). 

The part of the total variation in Y that is not associated with the re- 
gression is called the residual sum of squares. 1618 caleulated Бу 


Residual 88 = Zy!— Reduction SS =2.7826 —2.1115 20.0711. 


In analysis of variance we used the unexplained variation as a standard 
for testing the amount of variation attributable to treatments. We can 
do the same in regression. What's more, the familar F test will serve. 


Reduction 882 721115 














Source of variation ati 88 Ms: 
Due to regression Lg — 1 21115 2115 
Residual (i.e., unexplained)------- 60 06701 001118 
Total (= ду) 61 21826 





1 Ав there аге 62 values of Y, the total sum of squares has 61 df. 
‘The regression of Y on X hasonedf. The residual df are obtained by 
subtraction. 

58 


* MS is, as always =- SF 


The regression is tested by 


_ Regression М8 2.1115 


P="Residual MS 0.01118 ~ 15°88 


54 





mea на ва 


и mA ша на ни 


m а иш на ка 


— 


As the calculated Е is much greater than tabular Р 9; with 1/60 df, the 
regression is deemed significant at the 0.01 level. 

Before we fitted a regression line to the data, Y had a certain amount of 
variation about its mean (Y). Fitting the regression was, in effect, an 
attempt to explain part of this variation by the linear association of Y 
with X. But even after the line had been fitted, some variation was 
unexplained—that of Y about the regression line. When we tested the 
regression line above, we merely showed that the part of the variation in 
Y that is explained by the fitted line is significantly greater than the part 
that the line left unexplained. The test did not show that the line we 
fitted gives the best possible description of the data (a curved line might 
be even better). Nor does it mean that we have found the true mathe- 
matical relationship between the two variables. There is a dangerous 
tendency to ascribe more meaning to a fitted regression than is warranted. 

It might be noted that the residual sum of squares is equal to the sum of 
the squared deviations of the observed values of Y from the regression 
line. That is, 


Residual 88 = Z(Y — ?)*=2(¥—a—bX)? 


The principle of least squares says that the best estimates of the regression 
coefficients (а and Б) are those that make this sum of squares а minimum. 


Coefficient of Determination 


As a measure of how well а regression fits the sample data, we can 
compute the proportion of the total variation in Y that is associated with 
the regression. "This ratio is sometimes called the coefficient of deter- 
mination. 

Reduction 88 

"Total 58 


2.1115 
7378267 =0.758823 
When someone says, “76 percent of the variation in Y was associated 
with X," he means that the coefficient of determination was 0.76. 
The, coefficient of determination is equal to the square of the correlation 
coefficient. 


Coefficient of determination = 


Reduction SS. (Zzy)/Zz* __ (22y)? 


TotalSS зи “AA 
In fact, most present-day users of regression refer to т? values rather than 
to coefficients of determination. 
In the older literature, 1—7* is sometimes called the coefficient of non- 
determination, and мій 1—7 has been called the alienation index. 





Confidence Intervals 
Since it is based on sample data, a regression equation is subject to 


sample variation. Confidence limitson the regression line can be obtained 
by specifying several values over the range of X and computing 


Confidence limits= Ӯ 4-t4/ (Residual мв(1+ dE 


55 


Where Хо=а selected value of X, and 
Degrees of freedom for {= df for residual MS. 
In the example we had: 


P —0.13606--0.005902X 

Residual MS=0.01118 with 60 ар. 
п=62 
X —49.1935 

2a? = 59,897.0775. 


So, if we pick Хо-- 28 we have Р =0.303, and 
95-percent confidence limits 


а ; 
-0.308::2.0004 coos) 55+ зато) 
„807 
=0.270 to 0.338 


For other values of Хо we would get: 
95 percent limits 





Xv Y Lower Upper 
8 0.184 0.139 0.229 
49.1935 429 402 456 
70 553 521 .585 
90 .673 .629 ‘7 


In figure 2 these points have been plotted and connected by smooth curves. 
.80| 
n Т | | | 














o 20 40 60 80 100 
x 


Ficure 2.—Confidence limits for the regression of У on X. 


56 





Ша иш 


EE 





= 


m» иш зи um 





ma uj 


It should be especially noted that these are confidence limits on the 
regression of Y on X. They indicate the limits within which the true 
mean of Y for a given X will lie unless а one-in-twenty chance has occurred. 
The limits do not apply to a single predicted value of Y. The limits 
within which a single Y might lie are.given by 





Pan пена мене ) 





Assumptions.—In addition to assuming that the relationship of Y to X 
is linear, the above method of fitting assumes that the variance of Y about 
the regression line is the same at all levels of X (the assumption of homoge- 
neous variance or homoscedasticity—if you want to impress your friends). 
The fitting does not assume nor does it require that the variation of Y 
about the regression line follows the normal distribution. However, the 
Е test does assume normality, and so does the use of t for the computation 
of confidence limits. 

"There is also an assumption of independence of the errors (departures 
from regression) of the sample observations, The validity of this assump- 
tion is best insured by selecting the sample units at random. The require- 
ment of independence may not be met if successive observations are made 
on a single unit or if the units are observed in clusters. For example, a 
series of observations of tree diameter made by means of a growth band 
would probably, lack independence. 

Selecting the sample units so as to get a particular distribution of the 
X values does not violate any of the regression assumptions, provided the 
Y values are a random sample of all Y’s associated with the selected 
values of X. Spreading the sample over а wide range of X values will 
usually increase the precision with which the regression coefficients are . 
estimated. This device must be used with caution however, for if the У 
values are not random, the regression coefficients and residual mean 
squares may be improperly estimated, 


Multiple Regression 


It frequently happens that а variable (Y) in which we are interested is 
related to more than one independent variable. If this relationship can 
be estimated, it may enable us to make more precise predictions of the 
dependent variable than would be possible by a simple linear regression. 
This brings us up against multiple regression, which is a little more work 
but по more complicated than а simple-linear regression. 

The calculation methods can be illustrated with the following set of 
hypothetical data from a study relating the growth of even-aged loblolly- 
shortleaf pine stands to the total basal area (X), the percentage of the 
basal area in loblolly pine (Ха), and loblolly pine site index (Ха). 


57 














— 
Y Xx Xx х, 
65 а 79 15 
78 90 48 
85 53 87 74 
50 42 52 61 
55 57 52 59 
59 32 82 73 
82 т 80 72 
60 65 66 
113 98 96 99 
80, 81 90 
104 101 78 85 
92 100 59 88 
96 84 84 98 
65 72 48 70 
81 55 98 85 
т 77 68 71 
83 08 51 
97 95 82 ЕН 
90 70 78 
87 9з 61 89 
т 45 % 8i 
70 50 „ 
75 60 76 
75 68 74 76 
93 75 96 
76 82 58 80 
т 72 58 68 
61 46 69 65 
Sums 2206 1987 | 2008 |2179 
Means. 78.7857 70.9643 71.5397 77.8214 
(n 228) 














With this data we would like to fit an equation of the form 
Y=atbiXitbeX2tbiXs 
According to the principle of least squares, the best estimates of the X 


coefficients can be obtained by solving the set of least squares normal 
equations. 


b; equation: (Da:*)bit (®лл«)Ьз+(®л\тз)Ьз = Tay 
bz equation: (Zziz3)bi-- (Zz2*)bs-- (Zzaz3)bs 9 Dray 
b; equation: (Zzxzs)bi-- (Zzsz3) bi (®хз?)Ьз = 221 


where: Хх; = xxx, ZIX 
Having solved for the X coefficients (bı, bs, and bs), we obtain the constant, 


term by solving 
a=Y—b,X,—b.X.—bXs 
Derivation of the least squares normal equations requires a knowledge 
of differential calculus. However, for the general linear model with a 


58 


- 














шы = 


constant term 
Y=atbiXitbeXet ... БО, 

the normal equations can be written quite mechanically once their pattern 
has been recognized. Every term in the first row contains an zi, every 
term in the second row an га, and so forth down to the К“ row, every term 
of which will have an хь. Similarly, every term in the first column has an 
ту and a bi, every term in the second column has an zs and a bs, and so 
through the ка column, every term of which has an x, and a b... On the 
right side of the equations, each term has a y times the г that is вррго- 
priate for a particular row. So, for the general linear model given above, 
the normal equations are: 

b; equation: (Zz;?)by-- (Хага) ба + (дешава... (Хаа) = Улу 

b, equation: (Угла) + (132)6+ (хәта)... (ләл) = Trey 

Ъз equation: (Zzizs)bid- (Zzszs)ba-- (Ха)... + (Хз), = Dry 


b, equation: ( Zzyry)bid- (ааа) Ба + (2аааа)ба + + (л) = Угуу 
Given Ше X coefficients, the constant term can be computed as 
ав... X. 
Note that the normal equations for the general linear model include the 
solution for the simple linear regression 
Ул): = Хау 
Hence, 
bi (Ван) / За? as previously given. 
In fact, all of this section on multiple regression can be applied to the 
simple linear regression as a special case. 
The corrected sums of squares and products are computed in the 
familiar (by now) manner: 






xy zyi- CD. (вв... +00) - ОВ вртат 


‘ 
заре rxe- FAM a.. 4409 — ОВО 11,486.96483 


заря zx, y - 3900. an(6s)+ ... +4660 - 9800209 
=6,428.7858 
Similarly, 
заме — 11714042 Хану 2,682.2148 
Ул» =3,458.8215 за 2,000.1072 
за —5,998.9043 Элу =3,327.9286 


Ха 1,789.6786 
Putting these values in the normal equations gives: 


11,436.9643Ь, — 1,171.4642b2+-3,458.8215b3 = 6,428. 7858. 
~1,171.4642b;+-5,998.9643b2+ 1,789.6786b3 = 2,632.2143 
3,458.8215b,+-1,789,6786b2+-2,606.1072bs = 3,327.9286 


59 


"These equations can be solved by any of the standard procedures for 
simultaneous equations. One approach (applied to the above equations) 
is as follows: 

1. Divide through each equation by the numerical coefficient of by. 


$: —0.102,427,8976;+-0.302,424,7885, = 0.562, 105,960 
5, —5.120,911,334b, — 1.527,727,949bs= —2.246,943,867 
b1 +-0.517,424,389bs-+0.753,466,809bs =0.962,156,792 


2. Subtract the second equation from the first and the third from the 
first so as to leave two equations in b, and bs. 
5.018,483,437bs+ 1.830,152,737bs = 2.809,049,827 
—0.619,852,286b;—0.451,042,021bs=—0.400,050,832 
3. Divide through each equation by the numerical coefficient of Ба. 
$:--0.364,682,4306; =0.559,740,779 
b2+-0.727,660,494b; = 0.645,397,042 
4, Subtract the second of these equations from the first, leaving one 
equation in Ь;. 
—0.362,978,064Ь = —0.085,656,263 
5. Solve for bs 
—0.085,056,263 


b= rao 978.04 7 0-285,081,927 


6. Substitute this value of bs in one of the equations (say the first) of 
step З and solve for bz. 


bs-- (0.364,682,430) (0.235,981,927) =0.559,740,779 
$, = 0.473,682,316 
7. Substitute the solutions for ба and b; in one of the equations (say 
the first) of step 1, and solve for bi. 


b — (0.102,427,897) (0.473,682,316) + (0.302,424,788) (0.235,981,927) 
=0.562,105,960 
В: =0.539,257,459 


8. Аз a check, add up the original normal equations and substitute the 
solutions Гог bı, bs, and Ба. 


13,724.32106b; + 6,617.1787b2+-7,854.6073b3 = 12,388.9287 
12,388.92869 = 12,388.9287, check 


Given the values of bı, b», and bs we can now compute 
a—Y —b,Xi-b,X,- b; = —11.7320 
"Thus, after rounding of the coefficients, the regression equation is 
Y = —11.732--0.539X --0.474X4--0.230X 


It should be noted that in solving the normal équations more digits have 
been carried than would be justified by the rules for number of significant 
digits. Unless this is done, the rounding errors may make it difficult, to 
check the computations. 


60 


НЕ НО ма иш пш 


amp ug m 


t 





& 





ма ва ва эш ни 


ma mà un 





E 


Tests of Significance 


To test the significance of the fitted regression, the outline for the 
analysis of variance is 

















Source df 
Reduction due to regression on Ху, X;, and Х, 3 
Residuals. 24 
Totals. 27 








The degrees of freedom for the total are equal to the number of observa- 
tions minus 1. . The total sum of squares is 


Total SS= Ху? = 5,974.7143 


The degrees of freedom for the reduction are equal to the number of 
independent variables fitted, in this case 3. The reduction sum of squares 
for any least squares regression is 


Reduction SS= (estimated coefficients) (right side of 
their normal equations) 


In this example there are three coefficients estimated by the normal 
equations, and so 


Reduction 88 = (221) (22у) +bs( хану) 
= (0.53926) (6,428.7858) + (0.47368) (2,632.2143) 
+ (0.23598) (3,327.9286) 
= 5,498.9389 


The residual df and sum of squares are obtained by subtraction. Thus 
the analysis becomes 





Source df 88 MS 




















3 5,498.9389 1,832.9796 
Residuals. 24 475.7754 19.8240 
Total... 27 5,074.7148 
То test the regression we compute 
 1,832.9796 _ 
Рума тодо 79246 


which is significant at the 0.01 level. 

Often we will want to test individual terms of the regression. In the 
previous example we might want to test the hypothesis that the true value 
of b. is zero. This would be equivalent to testing whether the variable 
X; makes any contribution to the prediction of Y. If we decide that bs 
may be equal to zero, we might rewrite the equation in terms of X; and 
X. Similarly, we could test the hypothesis that b, and bs are both 
equal to zero. 


61 





To test the contribution of any set of the independent variables in the 
presence of the remaining variables: 
1. Fit all independent variables and compute the reduction and 
residual sums of squares. 
. Fit a new regression that includes only the variables not 
being tested. Compute the reduction due to this regression. 
. The reduction obtained in the first step minus the reduction 
in the second step is the gain due to the variables being tested. 
4. The mean square for the gain (step 3) is tested against the 
residual mean square from the first step. 


ю 


> 


Two examples will illustrate the procedure: 
I. Test X; and X, in the presence of Хз. 

1. The reduction due to Ху, Хз, and Xs is 5,498.9389 with 3 df. The 
residual is 475.7754 with 24 degrees of freedom (from previous 
example). 

2. For fitting X; alone, the normal equation is 

(За?) ба (Eray) 
ог 2,606.1072bs=3,327.9286 
53 1.27697 
The reduction due to X; alone is 
Red. 88 = b;( 2x14) = 1.27697 (3,327.9286) 
=4,249.6650 with 1 df. 


3. The gain due to X, and X; after X; is 


Gain 88 = Reduction due to Ху, Хз, X,— reduction 
due to X; alone. 


= 5,498.9389 — 4,249.6050 
=1,249.2739 with (3—1) =2 df. 
4. Then 
_Gain MS _62163095 
Residual MS 19.8240 


= 31.51, significant at the 0.01 level. 








Fumar 


This test is usually presented in the analysis of variance form: 





Source df ss MS 


Reduction due to Xi, Xa, and Хз. 
Reduction due to X; alone. 





3 5,498.9389 
1 4,2 50 





Gain due to Xi and X; after X. 
Residuals 






1,249.2739 624.63695 
24 415.7104 19.8240 


Кы. БЕЛЕС ЖИН РАЙ ЧУ ЛИКА. luus 27 5,074.7143 














624.63695 


198949 "31:51, significant at the 0.01 level. 


Fina 


62 


mE ша ва 


Е 


ба 


ва 











П. Test X; in the presence of X; and X; 

The normal equations for fitting X, and X; аге 
(За?) (Хал) = Улу 
(Zayxa)br+(Zas*)ba= Eray 
11,436.9643b; --3,458.8215b, = 6,428,7858 
3,458.8215b; +2,606,1072b; =3,327.9286 
The solutions are 


or 


$: = 0.29387 $: =0.88695 
"The reduction sum of squares is 
Reduction 88 b; ( zy) (Хау) 
= (0.29387) (6,428.7858) + (0.88695) (3,327.9286) 
=4,840.9336, with 2 df. 
The analysis is: 





Source df 88 MS 












Reduction due юй» M ed X. 
Reduction due to X; and. 







658.0053 658.0053 
475.754 19.8240 
Dalal, aeaea —— * 5074.743 








зод 33.10, significant at the 0.01 level. 


Coefficient of Multiple Determination 


As a measure of how well the regression fits the data it is customary to 
compute the ratio of the reduction sum of squares to the total sum of 
squares, This ratio is symbolized by R? and is sometimes called the 
coefficient of determination : 
R= Reduction 58 


Total SS 
For the regression of Y on Xi, Xs, and X;, 
5,498.9389 
5,974.7143 


The R? value is usually referred to by saying that а certain percentage 
(92 in this case) of the variation in Y is associated with the regression. 
The square root (2) of the ratio is called the multiple correlation coef- 


В 0.02 


ficient. 
The c-multipliers 
Putting confidence limits on а multiple regressio computation 
of the Gauss or c-multipliers. The e-multipliers меп requires elements of the 
68 


т т" 
= 
Е 
Г) 





inverse of the matrix of corrected sums of squares and products as they 
appear in the normal equations, "Thus, in fitting the regression of Y on 
Xi, Xs, and X; Ше matrix of corrected sums of squares and products is: 


Ул Улт Утыз 11,436.9643 —1,171.4642 3,458.8215 
Daye, За? Заз | =] —1,171.4642 5,998.9643 1,789.6786 
Erts Ух; Tas" 3,458.8215 1,789.6786 2,606.1072. 


The inverse of this matrix is: 


си Cm С 0.000,237,573 0.000,176,649 —0.000,436,615 
са Са Са | = 0.000,176,649 0.000,840,094 —0.000,468,616 
Cn C32 ев —0.000,436,615 —0.000,468,616 0.001,285,000, 


The matrix of c-multipliers is symmetric, and therefore с, = Сал, cis Сат, 
ete. 

The procedure for calculating the ¢-multipliers will not be given here. 
Those who are interested can refer to onevof the standard statistical text- 
books.such as Goulden or Snedecor. However, because the c-multipliers 
are the output of many electronic computer programs, some of their 
applications will be described. 

One of the important uses is in the calculation of confidence limits on 
the mean value of У (1.е., regression У) associated with a specified set of 
X values. The general equation for k independent variables is: 





confidence limits = Pat, [Residual м8 (i Усх: X); -x)) 
17 


where: i and j each go from 1 to k 
t has df equal to the degrees of freedom 
for the residual mean square. 


In the example, if we specify Х,=80:9643, X2=66.5357, and X;— 
76.8214, then Ӯ 81.576 and the 95-percent confidence limits are: 





81.576-42.0644 19.8240 [а +овооиз- Х)24-о(66.5357— Жо) 





+ свв (76.8214— X)?-++2c19(80.9643 — Х,)(66.5357— X4) 
+2сн(80.9643- X 1) (76.8214 — Хз) 





+2 (00,5357 Xa) 76.8214- X) | 


= 81576+2064 19.8240 [35001991520] 
= 1941 to 83.74 


Note that each eross-product term such as ci3(X;— X1) (X3— Ха) в multi- 
plied by 2. This results because in — отет both 7 and j we get the 
terms cjs(Xi— X1) (X3— X5) and ea(Xs— X3)(X1— Хз). Аз previously 
noted, the matrix of c-multipliers is symmetric, so that стз = са; hence we 
can combine these two terms to get 2c(X;— X1) (Xs— X3). 

For the confidence limits on a single predicted value of Y (as opposed to 
mean У) associated with a specified set of X values the equation is 


64 


ШИ mu mg NE GN ша пш NU HH ша NI 


0 





mé má um ш на ма 


на sd во ма m 





ғ Residual usi Хох, 90,0] 


With the above set of X values these limits would be 
72.13 to 91.02 


The c-multipliers may also be used to calculate the estimated regression 
coefficients. The general equation is 


bo делба) 

where: Угфу- The right hand side of the 19 normal equation 
G-1,...,k). 

"To illustrate, b; in the previous example would be calculated as: 
ba= enn) о (Ўз) + са Qoa) 


=0,000,176,649(6,428.7858) +-0.000,340,994(2,632.2143) 
-0.000,468,616(3,327.9286) 
=0.47369 as before (except for rounding errors) 


The regression coefficients are sample estimates and are, of course, 
subject to sampling variation. This means that any regression coefficient 
has а standard error and any pair of coefficients will have a covariance. 
The standard error of a regression coefficient is estimated by 


Standard error of b;= -V/c;;(Residual MS) 


Variance of b; c;;(Residual МВ) 
The covariance of any two coefficients is 





Hence, 


Covariance of b; and b;=¢;;(Residual МВ). 


The variance and covariance equations permit testing various hypothe- 
ре about the regression coefficients by means of a t test in the general 
torm 
1—28 
v Variance of Ө 
where: Ө = Any linear function of the estimated coefficients 
@=Hypothesized value of the function. 


То the discussion of the analysis of variance we tested the contribution 
of X; in the presence of X; and X; (obtaining Р=33.19**). This is 
actually equivalent to testing the hypothesis that in the model Y — 
a+b:Xi+b2X2+b3X; the true value of b, is zero. The ¢ test of this 
hypothesis is as follows: 

pe SUUS ы 
VVariance of be усь (Residual MS) 
" 0.47368. 
М0.000,340,994(19.8240) 
5.761, with 24 df=df for Residual МВ 


Note that ($—33.19 = the Е value obtained in testing this same hypothesis. 
65 





It is also possible to test the hypothesis that a coefficient, Ваз some value 
other than zero or that a linear function of the coefficients has some 
specified value. For example, if there were some reason for believing that 
bie 2b; or b; —26, =0, this hypothesis could be tested by 


i bi 20s 
vV Variance of (b; —2b;) 
Referring to the section on the variance of a linear function, we find that: 


Variance of (b; ~ 2b) = Variance of b;--4(variance of bs) 
—4(covariance of b; and 0з) 


= (си+4си-4си) (Residual MS) 


=19.8240[(0.000,237,573) 
+4(0.001,285,000) —4(—0.000,436,615)] 


=0.141,226,830 
Тһеп, 


= 0.53926 — 2(0.23598) a 
М0.141,226,830 
The hypothesis would not be rejected at the 0.05 level. 


tas at 








"The assumptions underlying these methods of fitting а 
multiple regression are the same as those for a simple linear regression: 
equal variance of Y at all combinations of X values and independence of 
the errors (1.е., departures from regression) of the sample observations. 
Application of the P or t distributions (for testing or setting confidence 
limits) requires the further assumption of normality. 

The reader is again warned against inferring more than is actually 
implied by a regression equation and analysis. For one thing, no matter 
how well a particular equation may fit a set of data, it is only a mathemat- 
ical approximation of the relationship between а dependent, and a set of 
independent variables. It should not construed as representing а 
biological or ph; al law. Nor does it prove the existence of a cause and 
effect relationship. It is merely a convenient way of describing an ob- 
served association, 

"Tests of significance must. al&o be interpreted with caution. A sig- 
nificant F or £ test means that the estimated values of the regression coeffi- 
cients differ from the hypothesized values (usually zero) by more than 
would be expected by chance. Even though a regression 18 highly signif- 
icant, the predicted values may not be very close to the actual (look at the 
standard errors). Conversely, the fact that a particular variable (say X;) 
is not significantly related to Y does not, necessarily mean that a relation- 
is lacking. Perhaps the test was insensitive or we did not select the 
proper model to represent the relationship, 

Regression analysis is a very useful technique, but it does not relieve the 
research worker of the responsibility for thinking. 











Curvilinear Regressions and Interactions 


Curves.—Many forms of curvilinear relationships can be fitted by the 
regression methods that have been described in the previous sections. 


If the relationship between height and age is assumed to be hyperbolic 
so that 


66 


и 


me us GN 





ка 

















Height= atai 


then we could let Y = Height and X,=1/Age and fit 
Ү=а+ЬХ\ 
Similarly, И the relationship between Y and X is quadratic 
¥=a+bX+cX* 
we can let X=X, and X?— X; and fit 


Y za-Fb;XiEbiX; 


Functions such as 
У-а » 


Y=a(b*) 
107 =аХ» 
which are nonlinear іп the coefficients сап sometimes Бе made linear by а 
logarithmic transformation. The equation 
Y-aX* 
would become 
log У< юс a+b(log X) 
which could be fitted by 
Y'2a'-FbiXs 
where Y'=log Y, and 
X;-log X. 
"The second equation transforms to 
log Y=log a+ (log 5) X 
The third becomes 
Y «log a+b(log X) 


Both ean be fitted by the linear model. 

In making these transformations the effect on the assumption of homo- 
geneous variance must, be considered. If Y has homogeneous variance, 
log Y probably will not have—and vice versa. 

Some curvilinear models cannot be fitted by the methods that have 
been described. Some examples are 


Y=a+b* 
Y=a(X—b)* 
Ү=а(Х.-0(Х-0) 


Fitting these models requires more cumbersome procedures. 

Interactions.—Suppose that there is a simple linear relationship between 
Y and X; И the slope (b) of this relationship varies, depending on the 
level of some other independent variable (Ха), then X, and X; are said to 
interact. Such interactions can sometimes be handled by introducing 
interaction variables. 

To illustrate, suppose that we know that there is a linear relationship 
between Y and Ха. 


67 





Y=a+bXi 
Suppose further that we know or suspect that the slope (b) varies linearly 
vit] 
Беа +02 
This implies the relationship 
Y za-- (a 4-VZ) X1 
Ү=а+аХ.+0Х.2 
which can be fitted by 
Ужа+иХ+и 


where Х< X;Z, an interaction variable. 
If the Y-intercept is also a linear function of 7, then 


a-a"4b"Z 
and the form of relationship is 
Y -a"--b"Z--a'Xy-VXiZ 
which could be fitted by 
Ү=а+ЬХ‹+ЬХ«+ЬХ+ 


ог 


where Х›= 2, and 
XX. 


Group Regressions 


Linear regressions of Y on X were fitted for each of two groups. The 
basic data and fitted regressions were: 





Group А Sum Mean 
d 3 7 9 6 8 13 10 12 14 82 9.111 
X il 4 T 7 2 9 10 6 12 58 6.444 





п=9, ZY?=848, ZXY = 609, ZX*—480 
Zy*— 100.8889, гу = 80.5556, Хз? = 106.2222, 
?-4.224+0.7584Х 
Residual SS = 39.7980, with 7 df. 
Group B. Sum Mean 
Y |4060 1 2 в 7 0 5 9 2 n 3 10| 79 6077 
X 149 469.12 7 5 5 П 2 13| 99 7615 








п 13, ZY?—653, ZXY —753, ZX*—951 
Dy? =172.9231, Угу- 151.3840, Хх? = 197.0769 
?=0.228+0.7681Х 
Residual SS = 56.6370, with 11 df. 


68 











ма mn un 


= sm 


Now we might ask, are these really different regressions? Ог could the 
data be combined to produce a single regression that would be applicable 
to both groups? If there is no significant difference between the residual 
mean squares for the two groups (this matter may be determined by 
Bartlett’s test, page 22), the test described below helps to answer the 
question. 

Testing for the common regressions.—Simple linear regressions may differ 
either in their slope or in theirlevel. In testing for common regressions 
the procedure is to test first for common slopes. If the slopes differ 
significantly, the regressions are different and no further testing is needed. 
If the slopes are not significantly different, the difference in level is tested. 
"The analysis table is: 




















| Residuals 
Line Group df zy Угу а |а 88 м8 
1 А 8 | 100.8889 80.5556 106.2222 | 7 39.7980 
2 B 12 | 172.9231 1513846 197.0769 | 11 56.6370 
3 Pooled residuals | 18 96.4350 5.3575 
4 Difference for testing common slopes 0.0067 0.0067 
5 Саша 20 | 273.81 231.9402 303.2991 96,4417 5.0759 
slope | 
6 Difference for testing levels Та 80.1954 80.1954 
7 Single 21.| 322.7727 213.0455 310.5909 | 20 1766371 
regression | 








The first two lines in this table contain the basic data for the two 
ups. To the left are the total df for the groups (8 for A and 12 for B). 
n the center are the corrected sums of squares and products. The right 
side of the table gives the residual sum of squares and df. Since only 
simple linear regressions have been fitted, the residual df for each group 
are one less than the total df. The residual sum of squares is obtained by 
first computing the reduction sum of squares for each group. 
2 
Reduction 88 = СЕЙ" 
Уг? 
This reduction is then subtracted from the total sum of squares (Ху?) to 
give the residuals. 

Line 3 is obtained by pooling the residual df and residual sums of 
squares for the groups. Dividing the pooled sum of squares by the pooled 
df gives the pooled mean square. 

The left.side and center of line 5 (we will skip line 4 for the moment) is 
obtained by pooling the total df and the corrected sums of squares and 
products for the groups. These are the values that are obtained under 
the assumption of no difference in the slopes of the group regressions. И 
the assumption is wrong, the residuals about this common slope regression 
will be. considerably larger than the mean square residual about the 
separate regressions, The residual df and sum of squares are obtained by 
fitting a straight line to this pooled data. The residual df are, of course, 
one less than the total df. Тһе residual sum of squares is, as usual, 


: Lon aoo (231-9402)? _ 
Residual 88 -273.8120- 055551 - 904417 


69 





Now, the difference between these residuals (line 4=line 5—line 3) 
provides a test of the hypothesis of common slopes. The error term for 
this test is the pooled mean square from line 3. 


Test of common slopes: Fiis md 


The difference is not significant. 


If the slopes differed significantly, the groups would have different re- 
gressions, and we would stop here. Since the slopes did not differ, we 
now go on to test for a difference in the levels of the regression. 

Line 7 is what we would have if we ignored the groups entirely, lumped 
all the original observations together, and fitted a single linear regression. 
The combined data are as follows: 


n= (9+ 18) =22 (so the df for total=21) 
EY = (82-79) = 161, 27? (848-653) = 1,501 


zy «1,501 — СВ 322.7727 
DX = (58-99) 2157, ®Х* (480-951) = 1,431 
sete isi - 057" 310.5900 


IXY = (009-753) = 1,362, Хгу- 1,862-- «апаву 


= 213.0455 


From this we obtain the residual values on the right side of line 7. 
(21 





): 
09 ^ 176.6371 


If there is a real difference among the levels of the groups, the residuals 
about this single regression will be considerably larger than the mean 
square residual about the regression that assumed the same slopes but 
different levels. This difference (line 6=line 7 —line 5) is tested against 
the residual mean square from line 5. 


Residual 88 =822.7727— 









80.1954 


954 _15 ggee 
$0759 7 15:80 


Test of levels: Руль а= 





As the levels differ significantly, the groups do not have the same 
regressions. 

"The test is easily extended to cover several groups, though there may 
be a problem in finding which groups are likely to have separate regres- 
sions and which can be combined. The test can also be extended to 
multiple regressions. 


Analysis of Covariance in a Randomized Block Design 


A test was made of the effect of three soil treatments on the height 
growth of 2-year-old seedlings. Treatments were assigned at random to 
the three plots within each of 11 blocks. Each plot was made up of 50 
seedlings. Average 5-year height growth was the criterion for evaluating 
treatments. Initial heights and 5-year growths, all in feet, were: 


70 


y am 


ба 


та GN 



























































Treatment А | Treatment В 
Block 
Height | Growth | Height | Growth 
1 3.6 89 | 81 107 | 47 124 | 114 | 320 
2 47 103 | 49 142 | 26 90 | 122 | 333 
3 26 63 8 59 | 15 тА 49 | 196 
4 53 140 | 46 126 | 43 103 | 142 | 367 
5 31 96 | 39 125 | 83 68 | 103 | 289 
6 18 64 | 17 96 | 36 100 та | 260 
7 58 i23 | 55 128 | 58 19 | 171 | 370 
8 38 108 | 26 80 | 20 T5 84 | 263 
9 24 80 | 11 75 | 16 52 54 | 207 
10 53 126 | 44 14 | 58 134 | 155 | 374 
n 3.6 ТА 4 84 | 48 107 98 | 25 
Sums | 420 |1064 | 340 | 1136 | 400 | 1044 | 116.0 | 3244 
Means | 3.82 9.67 | 309 | 10.33 | 3.64 949 | 3.52 | 988 
The analysis of variance of growth is: : 
Source. df 88 мв 
10 132.83 
2 4.26 2.130 
20 68.88 3.444 
32 205.97 
; 2.130 
F (for testing treatments) 2/20 дет 3.444 


Not significant at 0.05 level. 


There is no evidence of a real difference in growth due to treatments. 
There is, however, reason ко believe that, for young seedlings, growth is 
affected by initial height. A glance at the block totals seems to suggest 
that plots with greatest initial height had greatest 5-year growth. The 

ssibility that effects of treatment are being obscured by differences in 
Initial heights raises the question of how the treatments would compare if 
adjusted for differences in initial heights. 

If the relationship between height growth and initial height is linear 
and if the slope of the regression is the same for all treatments, the test 
of adjusted treatment means can be made by an analysis of covariance as 
— below. In this analysis, the growth will be labeled Y and initial 

ight X. 

mputationally the first step is to obtain total, block, treatment, and 
error sums of squares of X (SS,) and sums of products of X and Y 
(SP), just as has already been done for Y. ` 


| 
For X: ста NEO 40716 


Total 88, (8.64... 4-48) -C.T.,— 78.26 
т 





Block ss, (H4 398) or мш 


0910) ор gas 


Error $8,— Total Block — Treatment = 15.80 
ForXY: СТ. „= 01607600 =1,140.32 


Total 8Р„,= (3.6)(8.9)+ . . . + (4.8) (10,7) — C. T. 2y= 103.99 


Block sP.=( CLARO с.н) сол ват 
Treatment SP. -( (42.0) (106.4) + (34.0) (113.6) + (40.0). по) 
* п 
-C. z—3.30 
Error SP = Total — Block — Treatment = 24.58 
These computed terms are arranged in a manner similar to that for the 


test of group regressions (which is ехасйу what the covariance analysis 
is). One departure is that the total line is put at the top. 


‘Treatment вв.-( 








Residuals 
Source at | 88, SP, ss, | dí 88 MS 








6888 — 2458 - 15.80 | 19 3004] 1613 











On the error line, the residual sum of squares after adjusting for a linear 
regression is 


+ (8Р„)#_ ag 24.58: 
566, = — 
Residual 88-88, 88, 68.88 15.80 30.641 
This sum of squares has 1 df less than the unadjusted sum of squares, 
То test treatments we first pool the unadjusted df and sums of squares 
and products for treatment and error: · The residual terms for this pooled 
line are then computed just as they were for the error line ` 





1 Residuals 
df 88, ВР, 88, df 88 
Treatment + error..| 22 73.14 21.28 18.95 at 49.244 


Then to test for a difference among treatments after adjustment for the 
regression of growth on initial height, we compute difference in 
residuals between the error and the treatment + error lines 


di SS М8 
Difference for testing adjusted treatments | 2^ 18.603 9.302 


12 








a 


ma m на M аа ва 


= 


ка) me) gel 


Gu 





The mean square Гог (ће difference in residuals is now tested against 
the residual mean square for error. 


9.02... 
1955797 


Thus, after adjustment, the difference in treatment means is found to be 
significant at the 0.05 level. It may also happen that differences that 
were significant before adjustment are not significant afterwards. 

If the independent variable has been affected by treatments, interpreta- 
tion of a covariance analysis requires careful thinking. The covariance 
adjustment may have the effect of removing the treatment differences that 
are being tested. On the other hand, it may be informative to know that 
treatments are or are not significantly different in spite of the covariance 
adjustment. |The beginner who is uncertain of the interpretations would 
do well to select as covariates only those that have not been affected Бу 
treatments. 

The covariance test may be made in a similar manner for any experi- 
mental design and, if desired (and justified), adjustment may be made for 
multiple or curvilinear regressions. 

The entire analysis is usually presented in the following form: 


Pans at 





Adjusted 
Source dt 88, SB, 88. | df 88 MS 





32 205.97 103.99 73. 


T 8.15 
20 68.88 24.58 15.80 19 30.641 1,613 








Treatment. 
*FError....| 22 78.14 2128 1895 | 21 49244 | —— 
Difference for testing adjusted treatment means 2 18.603 9.302 
М 2.30 о: 
Unadjusted treatments: Роз a= 3 44d" Not significant. 


Adjusted treatments: Ко u- eer. Significant at the 0.05 level. 


Adjusted means.—If we wish to know what the treatment means are 
after adjustment for regression, the equation is 
Adjusted Y; Y,-b(X,—X) 
where: Y,-Unadjüsted mean for treatment г. 


Error 8Р,, 
Error 88. 


Х,- Mean of the independent variable for treatment i. 


b= Coefficient of the linear regression = 


X=Mean X for all treatments. 
In the example we had Х,-3.82, X5—3.09, Хс=3.64, X —3.52, and 


24.58 
b-l5807 1-56 


78 


So, the unadjusted and adjusted mean growths are 





Mean growths 
‘Treatment | Unadjusted Adjusted 
A 9.67 9.20 
B 10.33 11.00 
9.49 9.30 


Tests among adjusted means.—In an earlier section we encountered 
methods of making further tests among the means, Ignoring the covari- 
ance adjustment, we could for example make an Е test, for pre-specified 
comparisons such as A+C vs. B, ог А vs. С. Similar tests. can also be 
made after adjustment for covariance, though they involve more labor. 
The F test will be illustrated for the comparison B vs. A+C after ad- 
justment. 

As might be suspected, to make the F test ме must first. compute sums 
of squares and products of X and Y for the specified comparison: 


88,- I2(ZY 5) - (2¥a+ ZY o)]* (2(113.6)- бов. 4 104. 2r 
ии 





[2(ZX5) — (2Xa+ ZXo?.. [2(34.0) — зонд, 
FEFE 
SP.y „!2(®Ү»)— CURE еј 
FEFE 


From this point on, the F test of A+B vs. C is made in'exaetly the same 
manner as the test of treatments in the covariance analysis. 








=-3.48 





Residuals 
df 88, SP. 88, df 88 





1 4.08. -348 2.97 - — — 
.| 20 68.88 24.58 15.80 19 30.641 1.613 
Bii; es 21 72.96. 21.10 18.77 20 49.241 -- 
Difference for testing adjusted comparison 1. 18600 . 18,000 











Ру = 50011581. Significant at the 0.01 level 


74 











Bal 


а эщ д на на па на 


REFERENCES FOR FURTHER READING 


Arkin, H., and Colton, R. R. 
1963. Tables for statisticians. Ed. 2, 168 pp., illus New York: 
Barnes and Noble. 
Cochran, W. С. 
1963. Sampling techniques. Ed. 2, 413 pp. illus. New York: 
еу. 
— „вид Сох, С. M. 
1957. Experimental designs. Ed. 2, 611 pp. illus. New York: 
Wiley. 
Fisher, R. A. 
1954. Statistical methods for research workers. Ed. 12, 356 pp., 
illus. New York: Hafner. 
Freese, Е. 
1962. Elementary forest sampling. U.S. Dept. Agr., Agr. Handb. 
232, 91 pp. 


1964. Linear regression methods for forest research. U.S. Forest 
Serv. Res. Paper FPL-17, 136 pp., illus. Forest Products 
Laboratory, Madison, Wis. 

Goulden, С. H. 

1952. Methods of statistical analysis. Ed. 2, 467 pp., illus. New 

York: Wiley. 
Natrella, М. С. 

1963. Experimental statistics. U.S. Natl. Bureau Standards Handb. 
91, 522 pp., illus, 

Schumacher, Е. X., and Chapman, R. A. 

1954. Sampling methods in forestry and range management. Duke 
Univ. School Forestry Bul. 7. Ed. 3, 222 pp., illus. Durham, 
N.C. 

Snedecor, С. W. 

1956. Statistical methods. Ed. 5, 534 pp., illus. Ames, Iowa: 

Towa State Univ. Press. 
Steel, R. G. D., and Torrie, J. H. 

1960. Principles and procedures of statistics. 481 pp., illus. New 

York: McGraw-Hill. 
Walker, Н. М. 

1951. Mathematics essential for elementary statistics. Rev. ed., 382 

pp., Шив. New York: Holt. 
Wallis, W. A., and Roberts, H. V. 

1956. Statistics: a new approach. 646 pp., illus. Glencoe, Ш.: Free 

Press. 
Wilcoxon, Е. 

1949. Some rapid approximate statistical procedures. Rev. ed., 16 

pp. illus. New York: American Cyanamid Co. 


15 





APPENDIX TABLES 


Table 1.—Ratio of standard deviation to range for simple random samples of 
size n from normal populations 


боюлосеоно 





Abridged by of the author and pub- 
lishers fom table 2.22 of Suedecore апа рш, 
Methods (ed. 5), Towa State University Press. 


16 





= 
чи 






























































a Table 2.—Distribution of t 
Probability 
at 
F 5 4 3 2 1 05 92 01 001 
1.963 | 3.078 | 6.314 | 12.706 "31.821 | 63.657 | 636.619 
1.386 | 1.886 | 2.920 | 4.303 | 6.965 | 9.925 | 31.598 
1.250 | 1.638 | 2.353 | 3.182 | 4.541 | 5.841 | 12.941 
1.190 | 1.533 | 2.132 | 2.776 | 3.747 | 4.604 8.610 
1.156 | 1.476 | 2.015 | 2.571 | 3.365 | 4.032 6.859 
1.134 | 1.440 | 1.943 | 2.447 | 3.143 | 3.707 5.959 
1.119 | 1.415 | 1.895 | 2.365 | 2.098 | 3.409 5.405 
1.108 | 1.397 | 1.860 | 2. 2.896 | 3.355 5.041 
1,100 | 1.383 | 1.833 | 2.262 | 2.821 | 3.250 4.781 
.093 | 1.372 | 1.812 | 2.228 | 2,764 | 3.169 4.587 
В 088 | 1.363 | 1.796 | 2.201 | 2.718 | 3.106 4.437 
E .083 | 1.356 | 1.782 | 2.179 | 2.681 | 3.055 4318 
870 079 | 1.350 | 1.771 | 2.160 | 2.650 | 3.012 4221 
868 076 | 1.345 | 1.761 | 2.145 | 2.624 | 2.977 4.140 
i 366 | 1.074 | 1341 | 1758 | 2131 | 2.602 | 2.947 | 4073 
690 | 865 | 1071 | 1.337 | 1.746 | 2.120 | 2.583 | 2921 | 4015 
‘680 | ‘863 | 1.069 | 1338 | 1740 | 2.10 | 2.567 | 2808 | 3. 
‘688 | ‘862 | 1.067 | 1330 1734 2101 | 2562 | 2878 | 3 
688 | 861 | 1.066 | 1328 | 1720 | 2093 | 2539 | 2861| 3883 
687 .860 | 1.064 | 1.325 | 1.725 | 2.086 | 2.528 | 2.845 3.850 
.686 | во | 1068 | 1.323 | 1.721 | 2.080 | 2.518 | 2831 | 3.819 
686 858 | 1.061 | 1.321 | 1.717 | 2.074 | 2.508 | 2.819 3.792 
685 858 | 1.060 | 1.319 | 1.714 | 2.069 | 2.500 | 2.807 3.767 
685 857 | 1.059 | 1.318 | 1.711 | 2.064 | 2.492 | 2.797 3.745 
.684 .856 | 1.058 | 1.316 | 1.708 | 2.060 | 2.485 | 2.787 3.725 
684 .856 | 1.058 | 1.315 | 1.706 | 2.056 | 2.479 | 2.779 3.707 
684 855 | 1.057 | 1.314 | 1.703 | 2.052 | 2473 | 2.771 3.600 
683 855 | 1.056 | 1.313 | 1.701 | 2.048 | 2.467 | 2.763 3.674 
683 854 | 1.055 | 1.311 | 1.699 | 2.045 | 2.462 | 2.756 3.659 
1683 | :854 | 1.055 | L310 | L697 | 2.042 | 2457 | 2750 3.846 
В 681 851 | 1.050 | 1.303 | 1.684 | 2.021 | 2.423 | 2.704 3.551 
79 .848 | 1. 1.296 | 1.671 | 2.000 | 2.390 | 24 3.460 
7 .845 | 1.041 | 1.289 | 1.658 | 1.980 | 2.358 | 2,617 3.373 
] 674 842 | 1.036 | 1.282 | 1.645 | 1.960 | 2.326 | 2.576 3.291 
table III of Fisher and Yates’ Statistical Tables for 
Medical Oliver and Boyd, Ltd., Edinburgh. 
СР. Yates, by the literary executor of the late 
Е and by the publishers. 
Е T 











Table 3.—Distribution ој Е: 


78 






























- es © © se oh wo щш ш ве х 
.|33 33 2232 38 SE SS SS ND ND SS BS S3 08 
ИГЕРЕ ЕЕЕЕЕЕЕЕ 
g| a9 3532 39 Se S: SG ES S S8 S8 Sz G0 33 
s|8 22 39 S5 SD GB S ВЕ 5 Gp ЗЕМЕ MB ND 
e| AR £9 35 Sg 93 G8 99 88 R9 ag 55 a8 88 a 
зае 9 83 28 88 à si gg ЕЕЕ ay 
e| 8 58 85 Sp ЗА Бо ЗА SD RN SD M ag SD BE 

gi unuussssuuuiitu 
fo. $§ 29 $8 52 93 35 38 99 G5 58 58 4 BS n 
i) a | #8 32 98 #9 99 ЕЕ ЕЕЕ 
S| | $8 3:2 88 32 $9 SR S5 SS GR 89 Sp GR ag а 
G |x| #8 2 5 85 a5 $8 $9 49 a8 a8 24 He ии 
s #2 39 58 53 98 90 95 ЕЕ ЕЕ ЕЕЕ 
%| =| 98 35 88 49 ав Sb 4 де зо 48 Bg ЕЕЕ 
з: 88 48 SR ЕЕ 33 ga 58 89 На На За На 
ПЕЕ ЕЕЕ 
аа КЕНЕ 
-[88 IITIIITI. 
„| ЕЕЕ 
„|а ЕЕ: 
-|83 НЕ въз 
-| 88 3: 82 38 38 $9 33 ЕЕЕ 
-| 88 88 33 39 SB за 5g 32 S5 за 88 29 
-| 33 39 25 5% 33 35 39 39 33 33 a2 gi 
Фу Ana а че же е не ши жае 
i 














8 $5 83 4 а R А FR 8 Nh R А 8 


ЕЕЕ ЧЕ БА SE ЕД SE 25 ЕЕ 33 Nj 





SR 35 58 33 84 BS ЧА ЕЕ БА ЕЗ ЕЕЕ 
58 ig ЗЕ 83 33 ЧА 44 44 ЕЕ ЧИ БА Sq 44 44 4 

AR ЗА RS 88 38 ЕЕЕ 84 БА ЕД ка E S2 
38 28 38 85 82 S3 85 44 ЧА 84 ЕД НА БА НА БД 
38 За НА 35 ЗЕ 33 ВА 54 54 44 44 ЧА 44 БА BE SÍ 
32 38 28 58 ag 48 84 84 84 84 ag 44 XD Sd 84 ЕА 


IITIIILFIIITIII. 
ЕЕС: 
S 25 28 На 33 БИ ае ЫР ВЕ 88 58 88 44 44 


225 
20 
224 220 216 2. 


233 

336 

238 

$37 325 318 310 301 

2. 
3: 
2j 
$: 


ак 2 ка #8 за за 28 #3 85 ЗЕ S: ас 83 88 
48 цо ца че аа цо ма пи ме по се аа « ме 
ЕЕЕ ЕЕЕ 
БА S5 S5 Sn RE sg за 28 22 38 33 33 


5 
ied 
3 EE S8 SS ET ND да RS 98 S5 23 8 8 


239 
348 
233 


38 3333 3 HERE EIE 
ЗА 33 28 44 44 ЕЕ 32 82 ЕЕЕ 
33 35 38 32 33 33 83 24 ВА 85 АА R2 ЕЕ ЕЕ. 
за за де 95 $2 25 44 98 45 ДА ЧА 0 48 АА 88 НА 
8 38 28 S4 SE до S3 38 32 28 :3 85 58 28 НА 32 
tB $8 52 35 32 38 Bg ДЕ 35 49 22 За 32 34 44 55 
88 33 33 Ка бо ко 35 35 32 28 33 30 ДЕ 25 45 48 

35 ау 48 44 49 ди 45 48 85 SD ED 69 GS D ED 


$29 3.06 
542 489 


38 83 за за 38 58 38 25 35 49 84 49 44 44 53 
83 38 32 35 38 28 35 30 54 23 4 НА 83 22 22 32 
3933 33 38 S2 35 38 3: 38 32 че 90 32 98 За 32 





pes ee ae eR RS АА КАА а | 





за wd adj m но mà s Gp #4 ва ва а а Gnd ма 
: і 38 
18 
Er 
234 


1 First line of figures in each pair is for the 5% level; second line in each pair is for the 1% level. 











Table 3.—Distribution of Е (continued)! 
Degrees of freedom for greater mean square 


а а 8 8 9 4 se eg 
‚ |38 85 В За де ee ЧЕ ЧЕ ЗЕ ЗЕ 


g 78 98 88 33 44 48 3g SE ЧЕ 
g| 38 38 84 48 44 44 44 44 Gn 
a | 54 54 88 88 44 ца 44 44 ча 
НЕЕЕРЕЕЕЕРЕЕ 
= | 8 82 BD SE 88 58 SS 09 58 
НЕБЕ S7 SQ 38 SE Si 38 
8 
& 
я 





33 38 ЕҢ ЕД ZR RN RS са во за 
33 38 25 82 RA ЕВ RI RH на XS 
EL ELI 

35 85 BS 33 44 44 S3 44 га 


= | 58 88 84 B3 88 89 HY ЫЗ 23 4 
8: 














=| 38 84 84 84 33 24 84 53 84 
s ЗА 25 ВА ЗА 84 SI 84 54 за 84 
=| 38 88 86 28 зк 88 58 85 83 88 
а | 33 ЗА ЗА 88 58 gR ЕБ 85 НА 
= | За 33 38 ЕЕ 88 58 Si 
„| 84 88 да 38 98 са sz чи жа ой 
чч де Ged че да ча да че ня на 
.|88 48 S2 48 45 48 88 ВА аа ќа 
.|33 88 82 28 48 28 23 88 44 49 
| 48 35 48 34 94 48 94 32 55 54 
= | 88 33 48 88 44 48 35 35 45 45 
| 33 85 48 84 zd 48 48 44 88 ќа 
„| 34 58.58 55 55 49 да ES 28 #5 
8 we dé 6 46 ch có 88 44 dd 
ата Ж а " 
- | 32 3: 58 38 $E §8 84 44 45 84 
Bh аа 8 8 5 а 51 9 9 8 


2. 
2. 
ЕЕ 
Е * d : 
тш dup ощ 00 пи ug эщ 4 ош бош Фи ты ша па A та 

















Е 


EI 
88 8: 
23 
5 385 
* 





rd, Ltd., 


TX. 


353 


пе 


; second line in each pair is for the 1% level. 


OS 


EI IX 


БЕ 


‘each pair is for the 5% level 


kers, 


RE БЕ 85 55 ЗЕ 
КЫ 


E 


ек Worl 


by permission of the author and publishers from table 10.5.3 of, н 
Y ad Oliver and Bes 


or Кент 


Td E 





em ий m ый mà mà md ма 1 эш шї ый Фа са d ша. 
8 


Р 


8 зр а 8 


ү; 


1 First line of figures in 


Reproduced 


by thet 











Table 4.—Accumulative distribution of chi-square 

























































































Probability of а greater value 
freed НА ы 
едо А = 
.995 (0.990 10.975 0.950 10.900 [0.750 0.500 | 0.250 | 0.100 | 0:050 | 0.025 | 0.010 | 0.005 
1 - M 0.02 | 0.10 | 0.45 |. 1.32 3.84 | 502| 6.63| 7.88 
2 X 06 | 0. 0,21 | 0.58 | 1.39) 2.77 5.99 | 7.38 | 9.21 | 10.60 
3 0.11 | 0,22 | 0.35 | 0.58| 1.21 | 2.37 | 411 7.81 | 9.35 | 11.34 | 12.84 
4 0,30 | 0.48 | 0.71 13.28 | 14.86 
5 9.55 | 0.83 | 1.15 15.09 | 16.75 
в 0.87 | 1.24 | 1.64 16.81 | 18,55 
ја 0.99 1.24 | 1.69] 2.17 18.48 | 20.28. 
8 1.34 | 1.65 | 2.18 | 273 20.09 | 21.96. 
9 1.73 | 2.09 | 2.70 | 343 21.07 | 23.59. 
10 2.10 | 2.56 | 3.25 | 3.94. 23.21 | 25.19. 
n 2.60 | 3.05 | 3.82 4.57 24,72 | 26.76 
12 3.07 | 3.57 | 4.40 | 5.23 26.22 30 
13 3.57 | 4.11 | 5.01 | 5.89 27.60 | 29.82. 
14 407 | 4.66 | 5.63 | 6.57 29.14 | 31.32 
15 4.00 | 5,23 | 6.27 7.26 30.58 | 32.80. 
16 5.14 | 5.81 | 6.91 7.06 J 32.00 | 34.27 
17 |520|641| 7:56 867 30:10 | 34:41 | 35: 
18 6.26 | 7.01 | 8.23 | 9.30 17.34 | 21.60 | 25.09. 31.53 | 34.81 
19 6.84 | 7.63 | 8.01 10.12 18.34 | 22.72 | 27.20 32.85 | 36.19 
20 7.43 | 8.26 | 9.68 10.85. 2841 34.17 | 87.57 
21 8.03 | 8.90 (10.28 (11.50. 20.62 35,48 | 38.03. 
22 64 | 9.54 10.08 12.34. 30.81 36.78 | 40.29 
23 9.26 110.20 11.69 13.09 32.01 38.08 | 41.64. 
24 9.89 10.86 12.40 113.85 38.20 30.46 | 42.98. 
25 10.52 11.52 13.12 (14.61 34.38 40.05 | 44.31 | 46.08 
26 — [1.16 12.20 13.84 115.38. 35.56 41.02 | 45,04 | 48.20 
27 111.81 12.88 114.57 1615 30.74 43.19 | 46.96 | 49.64 
28 (12.46 |13.56 15,31 (16.93. 27.34 | 32.62 | 37.92. 44.48 | 48.28 50.09 
29 13.12 14.26 110.05 |17.71 28.34 | 38.71 | 39.00 45.72 | 49.59 | 52,34 
30 13.79 114.95 |16.79 |18.49. 120,34 | 34.80 | 40.26 46.98 | 50.80 | 53.67 
40 (20.71 [22.16 [24.43 (20.51 30.34 | 45.02 | 51.80 59.34 | 63.69 | 66.77 
50 [27.00 29.71 42.36 34.76 49.33 | 56.33 | 63.17 1142 iu 7949 
60 — (35.53 37.48 40.48 43.19. 59.33 | 69.08 | 74.40. 83.30. „8 | 91.95 
TO 43.28 45.44 48.76 161.74. (69.33 | 77.58 | 85.53 95.02 |100.42 [104.22 
BO [51.17 53,54 [57.15 |60.t 179.33 | 88.13 | 96.58. 106.63 |112.33 1116,82 
90 159.20 01.75 65.65 |69.1: [80.33 | 98.04 (107.656. 118.14 124.12 1128.30 
100 [07:33 [70.06 |73.22 77.03 00:38 (109.14 118/50. 129.56 [135.81 14017 
Reproduced permission of the author and publishers from table 1.14.1 
Ep 1) tome State University Press. Permission has also been given by 
Котета, 





























































П Table 5.—Confidence intervals for binomial distribution 
Е 95-percent interval 
Number 
Observed Size of sample, n Fraction | ^ Size of sample 
п “т 
Р 
К 10 15 20 30 50 100 | 250 | 1000 
0 o 31/0 22/0 170 120 0] 00 0 
1 0 45/0 32/0 250 17/0 11| 0 0 
2 3 562 40/1 341 20 м 02 0 
3 7 654 48 3 38 2 27 1 17. 03 1 
4 о 748 556 444 312 19 04 1 
5 (9 8112 62 9 496 353 22| 05 2 
6 |в 8816 6812 54 8 39 5 24| 06 2 
7 |85 9321 7815 5910 436 27| 07 3 
8 М4 9727 7919 6412 46 7 08 4 
9 155 10032 8423 6815 509 31| 09 4 
10 169 10038 8827 7317 530 34 10 5 
n 9232 7720 5012 30 11 5 
12 52 9636 8123 6013 38 12 6 
13 бо 9841 8525 6315 4] 3 7 
14 (68 1 28 6616 43 4 8 
15 78 10051 9131 6918 44| 15 9 
16 56 9434 7220 46) 16 9 
` 17 62 9737 7521 48 17 (0 
: 18 69 9940 7723 50 18 |п 
19 (75 10044 8025 3 19 12 
j 20 83 10047 8327 55 20 |13 
: 21 50 8528 57 21 14 
22 54 8830 59 22 14 
23 57 9032 61 23 15 
24 61 9234 63| 24 16 
25 65 9436 64 25 |17 
Е 26 69 аи 66 26 18 
b 27 73 39 68 27 19 37| 
28 78 9941 70) 38 |19 
29 83 10043 70 29 (0 
30 88 10045 73 50 21 
ЕН 47 16 m3 (2 
82 50 77 32 23 
33 52 70 33 ра 
34 54 80 3 [5 
35 56 s) 35 |26 
36 57 84 30 [27 
37 59 85 37 [28 
38 62 87 38 ра 
39 64 з 30 29 
40 66 90| 40 
41 69 оу 41 1 
42 71 93] 42 82 
43 73 94) 43 (|83 
44 16 95| 44 34 
45 78 97 45 85 
46 81 98) 46 36 
47 99 47 3 
48 86 100] 48 |38 
49 89 100] 49 |39 
50 93 100 » 40 
g 31 If p exceeds 0.50, read 1.00—p = fraction observed and subtract each confidence limit from 100. 














Table 5.—Confidence intervals for binomial distribution (continued) 












































99-percent interval 
Number Я 
observed Size of sample, п Fraction | біле of sample 
observed 
ЕН 

10 | 15 | 20 | 30 | 50 100 | 250 | 1000 
0 0 410 300 160 10 200 0 50 20 1 
1 0 540 400 220 14 OL 0 70 Е 02 
2 1 65 1 491 280 17 02 0 91 61 8 
3 4 742 56] 2 Е 1 20 9з 0 101 72 4 
4 8 815 634 36] 1 23 04 1122 93 6 
5 13 878 696 40| 2 26 05 1 132 103 7 
6 19 9212 74| 8 443 29| 06 2 M3 14 8 
7 26 9616 7911 484 31 07 2 103 135 9 
8 35 9921 8415 526 33 08 3 174 146 10 
9 46 10026 8818 55 7 36 09 3 185 157 12 
10 59 10031 9222 58 8 $ 10 4 196 168 13 
" 37 9526 6210 11 4 206 179 14 
12 44 9830 6511 43 A2 5 217 189 15 
13 51 9934 6812 45 A3 6 23 8 1910 16 
14 100/39 7114 47 14 6 24 9 2011 17 
15 70 10044 7415 4 15 7 20 9 2212 18 
16 49 1017 51 A6 8 2710 2313 19 
17 55 7918 17 9 2911 2414 20 
18 61 8220 48 9 3012 2515 21 
19 68 1 8421 57| 19 — [0 3113 2616 22 
20 (77 1 8623 59 20 11 3204 2717 23 
21 88/24 61 21 12 3315 2818 24 
22 9026 63 22 12 3416 3019 26 
23 9228 65 23 13 3517 3120 27 
24 94/29 67 24 14 36118 3221 28 
25 9631 69 25 15 3818 3322 29 
26 9733 71 26 16 3919 3422 30 
27 99/35 27 16 4020 3523 31 
28 72 100/37 74 28 17 4121 3624 32 
29 78 10039 76 29 18 4222 3725 33 
30 ВА 10041 77 30 19 4323 3826 34 
31 43 79) 81 20 4424 3927 35 
32 45 80 .32 21 4525 4028 36 
33 47 82 33 21 4626 4129 37 
34 49 83 34 22 4726 4230 38 
35 51 85 35 23 4827 4331 39 
36 53 86 36 24 4928 4432 40 
37 55 88 3 25 5029 4533 41 
38 57 89 38 26 5130 4634 42 
39 60 90 39 27 5231 4735 43 
40 62 92 40 28 5332 4836 44 
41 64 93 41 29 5433 5037 45 
42 67 94 42 29 5534 5138 46 
43 169 96 A3 30 5035 5239 47 
44 71 97 44 31 5736 5340 48 
45 74 98 45 32 5837 5441 49 
46 77 99 46 33 5938 5542 50 
47 80 99| 47 34 6039 5543 51 
48 83 100) 48 35 6140 5644 52 
49 |86 100) 49 36 6241 5745 53 
50 90 100, -50 87 6342 5846 54 

оро та) 














бесе 0.20, read. 1.00—p = fraction observed and subtract each confidence 
iced by permission of the author and publishers from table 1.3.1 of Snedecor’s 


limit fro: 100. 
Statistical Methods (ed. 5), Iowa State University 





= 





= 


mu эё md эн ти wu bei пу mu ши эң эш ы 








d 











Table 6.—Агс sine transformation 












































% Д.а |з [а | в г | 7 9 
fl 00 |o 0571 081 0.99 1.28) 1.40) 152] 162| 172 
" 01 | 181 1901 199) 207 222| 229| 236| 243 250 

02 | 256| 263| 269| 275 287| 292| 298| 303| 309 

03 | 314 319] 3241 329| 3341 339) 344 3.49) 353| 3.58 

04 | 363 367| 372) 376 3801 385| 389) 3.03| 3.97| 401 
1 05 | 405| 409| 413 417| 421| 425) 429| 433) 437| 440 

06 | 444| 448| 452 455) 459) 462) 406 | 469) 473| 476 

07 | 480| 483| 487) 400) 493| 497| 500| 503| 507| 5.10 

08 | 513| 516| 520) 523| 526| 5291 532| 535 541 

09 | 544| 547| 550| 5.53 5.56) 559 5.62 565 571 

1 574| 602| 629| 6.55| 6.80) 704| 727| 749 792 

2 813] 833 | &53| 872 891 9.10| 928| 9.46 : 

3 9.08 | 10.14 | 10.31 | 10.47 | 10.63 | 10.78 | 10.94 | 11.09 

4 1154 | 11.68 | 11.83 | 11.97 | 12.11 | 12.25 | 12.39 12.52 

5 12.92 | 18.05 | 18.18 | 13.31 | 13.44 | 13.56 | 13.69 | 13.81 
3 6 14.18 14:30 | 14.42 | 14:54 | 14.65 | 14:77 | 14.89 | 15.00 

7 15.34 | 15.45 | 15.56 | 15.68 | 15.79 | 15.89 | 16.00 | 16.11 

8 18.43 | 16.54 | 16.64 | 16.74 | 16.85 | 16.95 | 17.05 | 17.16 

9 17.46 | 17.56 | 17.66 | 17.76 | 17.85 | 17.95 | 18.05 | 18.15 

10 18.44 | 18.53 | 18.63 | 18.72 | 18.81 | 18.91 | 19.00 | 19.09 

n 19.37 | 19.46 | 19.55 | 19.64 | 19.73 | 19.82 | 19.91 | 20.00 

12 20.27 | 20.86 | 20.44 | 20.53 | 20.62 | 20.70 | 20.79 | 20.88 

13 2113 | 21.22 | 21.30 | 21.39 | 2147 | 21.56 | 2104 | 21.72 

м 21:97 | 22.06 | 22.14 | 22.22 | 22.30 | 22.38 | 22.46 | 22.55 

15 22.79 | 22.87 | 22.05 | 23.03 | 23.11 | 28.19 | 23.26 | 28.34 

18 23.58 | 23.66 | 23.73 | 23.81 | 23.80 | 23.97 | 24.04 | 24.12 

17 24.35 | 24.43 | 24.50 | 24.58 | 24.65 | 24:73 2488 

18 25.10 | 25.18 | 25.25 | 25.33 | 25.40 | 25.48 | 25.55 | 25.62 

19 25.84 | 25.92 | 25.00 | 26.06 | 26.13 | 26.21 | 26.28 | 26.35 

20 26.56 | 26.64 | 26.71 | 26.78 | 26.85 | 26.92 | 26.99 | 27.06 

21 27.28 | 27.35 | 27.42 | 27.49 | 27.56 | 27.03 | 27.69 | 27.70 1 

22 97 | 28.04 | 28.11 | 28.18 | 28.25 | 28.32 | 28.38 | 28.45 28.59 

23 66 | 28.73 | 28.79 | 28.86 | 28.93 | 29.00 | 29.06 | 29.13 29.27 

24 29.33 | 29.40 | 20.47 | 29.53 | 29.60 | 20.67 | 29.73 | 29.80 29.93 

25 30:00 | 30.07 | 30.13 | 30.20 | 30.26 | 30.33 | 30.40 | 30.46 | 30.53 | 30.59. 

26 30.66 | 30.72 | 30.79 | 30.55 | 30.92 | 30.98 | 31.05 | 31.11 | 81.18 | 31.24 

27 31:31 | 31.37 | 31.44 | 31:50 | 31.56 | 31.68 | 31.89 | 31.76 | 31.82 | 31.88 

28 31.95 | 32-01 | 32.08 | 32.14 | 32.20 | 32.27 | 32.33 | 32.39 | 32.46 | 32.52. 

29 32.58 | 32.65 | 82.71 | 32.77 | 32.83 | 32.90 | 32-96 | 33.02 | 33.09 | 33.15 

30 38.21 | 33.27 | 33.34 | 33.40 | 33.46 | 33.52 | 33.58 | 33.05 | 33.71 | 33.77 

31 33.83 | 33.80 | 33.96 | 34.02 | 34.08 | 34-14 | 34.20 | 34.27 | 34.33 | 34.39 

32 34.45 | 34.51 | 34.57 | 34.63 | 34.70 | 34.76 | 34.82 | 34:88 | 34.94 | 35.00 

33 35.06 | 35.12 | 35.18 | 35.24 | 35.30 | 35.97 | 35.43 | 35.49 | 35.55 | 35.01 

34 35.67 | 38.73 | 85.79 | 35.85 | 35.91 | 35.97 | 36.03 | 36.09 | 36.15 36.21 

35 36.27 | 36.33 | 36.39 | 36.45 | 36.51 | 36.57 | 36.63 | 36.69 | 36.75 | 36.81 

36 36:87 | 36.93 | 36.99 | 37.05 | 37.11 | 37.17 | 37.23 | 37.29 | 37.35 | 37. 

37 37.47 | 37.52 | 37.58 | 37.64 | 37.70 | 37.76 | 37.82 | 37.88 | 37.94 

38 38:06 | 38.12 | 38.17 | 38.23 | 38.29 | 38.35 | 38.41 | 38.47 | 38.53 

39 38.65 | 38.70 | 38.76 | 38.82 | 38.88 | 38.94 | 39.00 | 39.06 | 39.11 

40 30.23 | 39.29 | 30.35 | 39.41 | 39.47 | 39.52 | 39.58 | 39.64 | 39.70 

а 39:82 | 39.87 | 30.03 | 39.99 | 40.05 | 40.11 | 40.16 | 40.22 | 40.28, 

42 40.40 | 40.46 | 40.51 | 40.57 | 40.63 | 40.69 | 40. .80 | 40.86 

43 40.98 | 41.03 | 41.09 | 41.15 | 41.21 | 41.27 | 41; .38 | 4144 

44 41:55 | 41.61 | 41.67 | 41.73 | 41.78 | 4184 | 41.90 | 41.96 | 42.02 

45 42.13 | 42.19 | 42.25 | 42:30 | 42.38 | 42.42 | 42.48 | 42.53 | 42.59 

46 42.71 | 42:76 | 42.82 | 42.88 | 42.94 | 42,99 | 43.05 | 43.11 | 43.17 

47 43.28 | 43.34 | 43.39 | 48.45 | 43.51 | 43.57 | 43.62 | 43.68 | 43.74 

48 43.85 | 43.91 | 43.07 | 44.03 | 44.08 | 44.14 | 44.20 | 44.25 | 44.31 

49 4443 | 4448 | 44.54 44.60 | 44.66 | 4.71 | 44.77 | 44.83 | 44.89 

50 45:00 | 45.06 | 45.11 | 45.17 | 45.23 | 45.29 | 45.34 | 45.40 | 45.46 

51 45.57 | 45.63 | 45.69 | 45.75 | 45.80 45,92 | 45.97 | 46.03 

52 46,15 | 46.20 | 46.26 | 46.32 | 46.38 | 46.43 | 46.49 | 46.55 | 464 

53 46.72 | 46.78 46.89 | 46.95 | 47.01 | 47.06 | 47.12 

54 47.29 | 47.35 | 47.41 | 47.47 | 47.52 | 47.58 | 47.64 | 47.70 








Table 6.—Атс sine transformation (continued) 






























% 0 1 2 3 4 5 6 T 8 
55 47.87 | 47.93 | 47.98 | 48.04. 48.16 | 48.22 | 48.27 Гаазе 48.39 
56 48.45 48.56 | 48.62 48.73 | 48.79 | 48.85 | 48,01 | 48.07 
57 49.02 | 49.08 | 49.14 | 49. 49.31 | 49.37 | 49.43 | 49.49 | 40.54 
58 49.60 | 49.66 | 49.72 | 49.78. 49.89 | 49.95 | 50.01 | 50.07 | 50.13 
59 50.18 | 50.24 | 50.30 | 50.36. 50.48 | 50.53 | 50.59 | 60.65 | 50.71 
60 50.77 | 50.83 | 50.89 | 50.94. 51.06 | 51.12 | 51.18 | 61.24 | 51.30. 
6 51.35 | 51.41 | 51.47 | 51.53 51.66 | 51.71 | 51.77 | 51.83 | 51.88. 
62 51.94 | 52.00 | 52.06 | 52.12 52.24 30 | 52.36 | 52.42 | 52.48. 
63 52.58 | 52.59 | 52,65 | 52.71 52.83 | 52.89 | 52.95 | 53.01 | 53.07 
64 53.13 | 53.19 | 53.25 | 53.31 53.43 | 53.49 | 53.55 | 53.61 | 53.67 
65 53.73 | 53.79 | 58.85 | 53.91 54.03 | 54.09 | 54.15 | 54.21 | 54.27 
66 54.33 | 54.39 | 54.45 | 54.51 54.63 | 54.70 | 54.76 | 54.82 | 54.88. 
67 54.94 | 55.00 | 55.06 | 55.12 55.24 | 55.30 | 55.37 | 55.43 | 55.49. 
68 55.55 | 55.61 | 55.67 | 55.73 55.86 | 55.92 | 55.08 | 56.04 | 56.11 
69 56.17 | 56.23 | 56.29 | 56.35 56.48 | 56.54 | 56.60 | 56.66 | 56.73. 
70 56.79 | 56.85 | 56.91 | 56.98 57.10 | 57.17 | 57.23 | 57.29 | 57.35 
71 57.42 | 57.48 | 57.54 | 57.61 Ó 57.80 | 57.86 | 57.92 | 57.99. 
72 58.05 | 58.12 | 68.18 | 58.24 58.44 | 58.50 | 58.56 | 68.63. 
78 58.60 | 58.76 | 58.82 | 58.89 59.08 | 59.15 | 60.21 | 50.28 
74 59.34 | 59.41 | 60.47 | 59.54 59.74 | 59.80 59.87 | 59.03 
75 60.00 | 60.07 | 60.13 | 60.20 60.40 | 60.47 | 60.53 | 60.60 
76 60,67 | 60.73 | 60.80 | 60.87 4 А 61.21 | 61.27 
77 61.34 | 61.41 | 61.48 | 61.55. 61.89 | 61.96 
78 62.03 | 62.10 | 62.17 | 62.24 62.58 | 62.65 
79 62.72 | 62,80 | 62.87 | 62.04 63.29 | 63.36 
80 63.44 | 63.51 | 63.58 | 63.65 64.01 | 64.08 
81 64.16 | 64.23 -30 | 64.38 64.75 | 64.82 
82 64.90 | 64.97 | 65.05 | 65.12 65.50 | 65.57 
83 65.65 | 65.73 | 65.80 | 65.88. 66.27 | 66.34 
84 66.42 | 66.50 | 66.68 | 66.66 1 67.05 | 67.13 
85 67.21 | 67.29 | 67.37 | 67.45 б 67.86 | 67.04 
86 08.03 | 68,11 | 68.19 | 68.28. od 68.70 | 68.78. 
87 68.87 | 68.95 | 69.04 | 69.12. E 5 69.56 | 69.64 
88 69.73 | 60.82 | 60.01 | 70.00 L | 70.45 | 70.54 
80 70.63 | 70.72 | 70.81 | 70.91 a Б 71.37 | 71.47 
90 71.56 | 71.06 71.76 | 71.85 2 72. 72.34 | 7244 
91 72.54 | 72.64 | 72.74 | 72.84 3. 34 73.36 | 73.46. 
92 73.57 | 73.68 | 73.78 | 73.89 s „82 | 74,44 | 74.55 
93 74.66 | 74.77 | 74.88 | 75.00 .3 5 75.58 | 75.70 
94 15.82 | 75.94 | 76.06 | 76.19 4 j| 76.82 | 76.95 
95 77.08 | 77.21 | 77.34 | 77.48. 1 i 78.17 | 78.32 
78.46 | 78.61 | 78.76 | 78.91 E .53 | 79.69 | 79.86 
97 80.02 | 80.19 | 80.37 | 80.54 Е 4 81.47 | 81.67 
98 81.87 | 82.08 | 82.29 | 82.51 : Я 83.71 | 83.98. 
99.0 84.26 | 84.29 | 84.32 | 84.35 3 x 84.50 | 84.53 
99.1 84.56 | 84.59 | 84. 84.65 84.74 | 84.77 | 84.80 | 84.84 
99.2 84.87 | 84.90 | 84.93 | 84.97 85.07 | 85.10 | 85.13 | 85.17 
99.3 85.20 | 85.24 | 85.27 | 85.31 85.41 | 85.45 | 85.48 | 85.52 
99.4 85.56 | 85.60 | 85.63 | 85.67 85.79 | 85.83 | 85.87 | 85.91 
99.5 85.95 | 85.09 | 86.03 | 86.07 86.20 | 86.24 | 86.28 | 86.33 
99.6 86.37 | 86.42 | 86.47 | 86.51 86.66 | 86.71 | 86.76 | 86.81 
99.7 86.86 | 86.91 | 86.97 | 87.02 87.19 | 87.25 | 87.31 | 87.37 
99.8 87.44 | 87.) 87.57 | 87.64 87.86 | 87.93 | 88.01 | 88.10 
99.9 88.19 | 88.28 | 88.38. 48 $8.85 | 89.01 | 89.19 | 89.43 
100.0 90.00 - = — - - = = 
































Reproduced by permission of the author sind publishers from table 11.12.1 of 
Snedecor's Statistical Methods (ed. 5), Iowa State University Press. Permission has also 
been granted by the original author, Dr. С. I. Bliss of the Connecticut Agricultural 


Experiment Station. 








b 





Table 7.—Significance of correlation coefficients 


[CORRELATION COEFFICIENTS AT THE 5% AND 1% LEVELS OF 











SIGNIFICANCE] 
Degrees of Degrees of 

freedom 5% 1% | freedom 5% 1% 
1 997 1.000 24 388 496 
2 950 25 381 487 
3 878 959 26 за 478 
4 ви 917 27 367 470 
5 та 874 28 361 463 
6 A07 834 29 355 456 
7 666. 798 30 349 440 
8 632 765 35 325 418 

9 602 735 40 304 
10 +576 „708 45 288 372 
n .553 „684 50 213 354 
12 532 661 60 250 325 
13 514 641 10 282 302 
14 497 623 80 2\7 283 
15 A82 606 90 205 2267 
16 468 590 100 195 254 
17 456 575 125 114 228 
18 444 561 150 159 208 
19 433 549 200 138 181 
20 423 537 300 113 148 
21 413 526 400 098 128 
22 "404 515 500 088 115 
23 396 .505 1,000 .062 .081 





Reproduced Бу ion of the author and publishers from 
table 7.6.1 of юга Biatatical Methods (od. 2) Towa State 
University Press. Permission has also been granted by the 
literary executor of the late Professor Sir Ronald A. Fisher and 
Oliver and Boyd, Lid. publishers for the portion of the table 
taken from в table У.А. in Statistical Methods for 
esearch Workers, 


87 











